Page MenuHomePhabricator

Newlines after </li> cause dirty diffs with Parsoid
Closed, ResolvedPublic

Description

We would like to revert the fix for bug 39617. That bug should have been fixed via the 'whitespace' CSS property.

The 'fix' to bug 39617 caused Parsoid to begin creating dirty diffs (and failing its test suite) -- see https://gerrit.wikimedia.org/r/92978


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=39617

Details

Reference
bz56809

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:22 AM
bzimport added a project: Parsoid.
bzimport set Reference to bz56809.

Change 94443 had a related patch set uploaded by Cscott:
Revert "Have list items occupy their own line"

https://gerrit.wikimedia.org/r/94443

Why does this cause dirty diffs? Can I get a link to such diff?

Matma(In reply to comment #3)

Why does this cause dirty diffs? Can I get a link to such diff?

See the gerrit patch cscott linked to in the bug description (https://bugzilla.wikimedia.org/show_bug.cgi?id=56809#c0)

I've already seen the patch and sadly it is not telling me anything. What does a "dirty diff" mean here? A dirty diff during after editing a page with VisualEditor? Because that's the only significant definition I know.

Also, see https://gerrit.wikimedia.org/r/#/c/93626/ where I started messing with Parsoid to prevent these dirty diffs, but it seems a lot more pain than worth it currently -- it is mostly a distraction from other critical issues that I just left it in WIP stage. This wont be an issue in production necessarily right now, but if the CSS fix is not employed, then Parsoid will also have to emit the same kind of HTML that the core php parser emits and then this could be an issue in production with occasional dirty diffs without this WIP patch being worked out to completion. So, seems like a lot of pain for unclear benefit.

(In reply to comment #5)

I've already seen the patch and sadly it is not telling me anything. What
does
a "dirty diff" mean here? A dirty diff during after editing a page with
VisualEditor? Because that's the only significant definition I know.

Parsoid is failing the parser tests in html2wt and html2html mode since the same parser tests are used by core and Parsoid. I blacklisted the failures for now. But this blacklisting is not ideal since we have to be a bit more careful about failures in the future when inspecting changes.

That aside, if VE and Parsoid emits the same HTML that the core parser does for lists (emitting newlines around <ul> tags), then we have to deal with it.

Oh for crying...! Please provide a COMPLETE working set of CSS for horizontal lists that supposedly fixes all the bugs I have spent TWO YEARS trying to fix! I have tested EVERY scenario, including inline-block, which breaks nested lists in *ALL* browsers.

Unless you come up with a working CSS solution (which I doubt you will), I insist my patch be reinstated and Parsoid be fixed instead.

(In reply to comment #8)

Unless you come up with a working CSS solution (which I doubt you will), I
insist my patch be reinstated and Parsoid be fixed instead.

(Just a note that the patch was not reverted, but a revert has been submitted for review / consideration.)

Is there a demo of hlist set up somewhere? Or a list of things it does, and a list of cases it supports? Any of these things would be very useful here.

https://test.wikipedia.org/wiki/MediaWiki:Common.css contain current CSS for hlist.
https://test.wikipedia.org/wiki/User:Edokter/sandbox contains the test cases.

Any suggested changes will break one or more testcases.

(In reply to comment bug:39617#27)

The Tidy bug is what caused hlists to work in the first place, according to
what you're saying :)

Why is that even considered a bug? list items are block level elements, and as such, should be on their own line per unwritten code style conventions.

(In reply to comment #11)

(In reply to bug 39617 comment #27)

The Tidy bug is what caused hlists to work in the first place, according to
what you're saying :)

Why is that even considered a bug? list items are block level elements, and
as such, should be on their own line per unwritten code style conventions.

Tidy's overzealous newline insertion has caused problems in the past already, see bug 260 (which is still open only because of the same Tidy issue) and bug 38800 (which is basically what Gabriel reported).

Just to guide the discussion, let's try to keep this bug concentrated on the parsertest regressions in Parsoid and other extensions, and possible fixes for that. Discussion of the hlist issue and the CSS fix (or lack thereof) goes over on bug 39617.

Since most of the discussion has already happened on bug 39617, I'd say that, if in doubt, let's continue the discussion there. I'm just trying to prevent the discussion from being scattered across gerrit and a bunch of different bugzilla items, and people's good points being lost in the shuffle. (In particular, I'm going to copy comment 10 over to bug 39617.)

That is a scoping problem; Tidy should simply not touch anything that comes out of GeSHi (pre & code). This is simply two external extentions clashing.

I also have a problem understanding 'dirty diffs'. Where does parsoid get its HTML from? I thought it converted wikitext to HTML and then back. So how does the parser even come into play? Is Tidy also involved? Sorry to ask al this, but I simply need a basic understanding of parsoid.

FYI, the nowrap has been completely removed from hlist, so there are no CSS hacks to deal with. But any locally applied nowrap may cause errors, so the linebreaks must remain.

I can't help but feel that this issue is a major blocker for any parser improvement in the future, so I want to understand exactly where the problem exists and come up with a solution that will prevent such issues in the future.

Parsoid converts wikitext to HTML and back. Tidy does not enter the picture at all in the Parsoid pipeline.

There are two separate issues:

  1. Parsoid uses the same set of PHP parser tests to make sure that Parsoid generates HTML that doesn't break existing wikitext. It uses PHP HTML to serialize back to wikitext. The extra newline that PHP parser emits introduces extra newlines in wikitext. "a\n\n*b" instead of "a\n*b". So, the extra newline is a "dirty diff". This now obscures our regular testing because we blacklist these failing tests and then try to compare changes in HTML output. So, not entirely unworkable, but requires greater diligence to monitor regressions during development. Ideally, we wouldn't have these failures. Alternatively, we have to fix Parsoid's HTML to wikitext conversion to ignore the extra newlines when they come from the parser's insertion rather than from source -- that means extra bookkeeping.
  1. If you want Parsoid HTML to have the same CSS semantics like PHP HTML, Parsoid would also have to start adding these newlines after <ul> and before </ul>. These extra newline characters would now have to be accounted for in the DSR calculations (DSR maps a range to wikitext source to a DOM subtree, ex: wikitext substring [3,9] generate <p>foobar</p> HTML) which is also additional bookkeeping. DSR calculations has been carefully tweaked to account for every single character of wikitext so that Parsoid can faithfully output original wikitext on unedited portions of the HTML. So, Parsoid would have to ignore the extra inserted newlines in its calculations. Without accounting for them, Parsoid will generate different wikitext on conversion which manifests as "dirty diffs". Again, all of this extra bookkeeping can be done.

But, if we can avoid the extra newlines, the extra work can be avoided.

(In reply to Erwin Dokter from comment #15)

I can't help but feel that this issue is a major blocker for any parser
improvement in the future, so I want to understand exactly where the problem
exists and come up with a solution that will prevent such issues in the
future.

This is not a general blocker. Parser changes can be done and have been done, but changes have to be considered carefully since there are a lot of dependent users, and that is what you are seeing here in this bug report discussion.

Re: "a\n\n*b"

I don't understand where the extra newline comes from, as the parser outputs only one. Regardless... perhaps we need to go about this another way.

There should be a simple set of rules that govern how parser and parsoid interact, for example:

  • block level elements (including list itema) must be terminated with e newline
  • all other (inline) elements should not

Would that ease the situation?

(In reply to Erwin Dokter from comment #18)

  • block level elements (including list itema) must be terminated with e

newline

  • all other (inline) elements should not

Would that ease the situation?

This would in fact complicate the situation. We don't insert newlines in HTML where there were none in the original wikitext. This preserves the semantics independent of the CSS whitespace styling (remember that 'pre' is a possible value) and helps us to round-trip wikitext correctly without introducing spurious newlines.

As discussed in bug 39617, the newlines don't seem to be necessary any more now that the nowrap requirement is gone. Taking into account the cost they have for Parsoid development, I'd recommend removing the code that inserts them from the PHP parser.

OK, this discussion is not going to end in the next couple of hundred years, partly because I will never understand what is really going on. I still don't know where parsoid is getting its HTML from. Can't be from the browser DOM; that has gone through Tidy.

Revert the patch.

I just want the parser (as long as it lasts) to emit some sanely formatted lists, just as it outputs sanely formatted tables. Is that too much to ask?

If I can make the parser to make a clean break *only* between </li> and <li>, without any linebreak before or after <ul> and </ul>, would that make parsoid happy?

(In reply to Erwin Dokter from comment #20)

OK, this discussion is not going to end in the next couple of hundred years,
partly because I will never understand what is really going on.

Do you want to pop onto IRC (#mediawiki-parsoid) sometime and chat with us about this? We can then post the summary here after that. It might be simpler to figure this out in a more interactive discussion than is possible via bugzilla comments. I am going to be signing off for the day shortly, but the IRC channel is usually active between 9 am PST - 4 pm PST when most of us are around.

I still
don't know where parsoid is getting its HTML from. Can't be from the browser
DOM; that has gone through Tidy.

No, Parsoid generates the HTML as I indicated in Comment 16. In the case of Visual Editor use, VE then sends that HTML (alongwith any user edits) to Parsoid to convert back to wikitext.

Since Parsoid HTML will replace PHP parser HTML at some point in the foreseeable future (this year possibly), if anything requires these newlines to be present for functionality, Parsoid will also have to add them and that is what we are complaining about so far.

Ah, so it is only the *tests* that are affected. I think I understand now.

Still my question remains: can I make the parser make a clean break *only* between </li> and <li> (as that is the core issue affecting nowrap), without any linebreak before or after <ul> and </ul>?

(In reply to Erwin Dokter from comment #22)

Ah, so it is only the *tests* that are affected. I think I understand now.

Not just tests. See part (2) in Comment 16.

Still my question remains: can I make the parser make a clean break *only*
between </li> and <li> (as that is the core issue affecting nowrap), without
any linebreak before or after <ul> and </ul>?

That is already the case since wikitext has newlines between list items. See below for Parsoid's output.

[subbu@earth tests] echo "*a\n*b\n*c" | node parse
<body data-parsoid='{"dsr":[0,9,0,0]}'><ul data-parsoid='{"dsr":[0,8,0,0]}'><li data-parsoid='{"dsr":[0,2,1,0]}'>a</li>
<li data-parsoid='{"dsr":[3,5,1,0]}'>b</li>
<li data-parsoid='{"dsr":[6,8,1,0]}'>c</li></ul>
</body>

Sorry, but it isn't. The *old* PHP parser output is:

<ul><li>a
</li><li>b
</li><li>c
</li></ul>

No linebreak *between* list items. Parsoid goes (if I understand above correctly):

<ul><li>a</li>
<li>b</li>
<li>c</li></ul>

I want the PHP parser to behave the smae.

Parsoid does not insert newlines where there are none:

<ul><li>foo</li><li>bar</li></ul>

in wikitext becomes

<ul><li>foo</li><li>bar</li></ul>

It does preserve newlines from wikitext though:

  • foo
  • bar
  • baz

in wikitext becomes

<ul><li> foo</li>
<li> bar</li>
<li> baz</li></ul>

PHP + tidy inserts newlines in the first case too, which is broken.

(In reply to Gabriel Wicke from comment #25)

PHP + tidy inserts newlines in the first case too, which is broken.

I'm pretty sure that this is the case only if Tidy is enabled, and that kind of behavior is one of the many reasons why I want to kill it. (I really do not know why no one from the Parsoid team shares my sentiment about this.)

Change 134380 had a related patch set uploaded by GWicke:
Update list item newline handling to follow Parsoid's model

https://gerrit.wikimedia.org/r/134380

(In reply to Bartosz Dziewoński from comment #26)

(In reply to Gabriel Wicke from comment #25)

PHP + tidy inserts newlines in the first case too, which is broken.

I'm pretty sure that this is the case only if Tidy is enabled, and that kind
of behavior is one of the many reasons why I want to kill it. (I really do
not know why no one from the Parsoid team shares my sentiment about this.)

Believe me, we want to kill it just as much as you do. And we have been working hard on it for 2+ years now.

Change 134380 merged by jenkins-bot:
Update list item newline handling to follow Parsoid's model

https://gerrit.wikimedia.org/r/134380

Change 94443 abandoned by Cscott:
Revert "Have list items occupy their own line".

Reason:
Abandoned in favor of Ib7aa9449bbd994cb23b83b3f23cff944b1cddadf

https://gerrit.wikimedia.org/r/94443

This has since been fixed in core.