Page MenuHomePhabricator

Ignore imperfect table syntax, or warn loudly about it getting surfaced if VEditing
Open, MediumPublic

Description

This table https://en.wikipedia.org/wiki/Dancing_with_the_Stars_(U.S._TV_series)#Professional_partners was broken.
This https://en.wikipedia.org/w/index.php?title=Dancing_with_the_Stars_%28U.S._TV_series%29&diff=598301735&oldid=598141704 happened when the user just tried to add a wikilink (you can find it at the very bottom of the diff).
This edit https://en.wikipedia.org/w/index.php?title=Dancing_with_the_Stars_%28U.S._TV_series%29&diff=598390953&oldid=598363952 seems to have fixed the table (https://en.wikipedia.org/w/index.php?title=User%3AElitre_%28WMF%29%2FSandbox&diff=598393459&oldid=598393277 ).

Users are not happy, and demand that "When a "bad" syntax doesn't break the table while editing on wikitext it shouldn't break it while editing with VE either.",
"Maybe there should be some sort of Edit Notice or similar when Parsoid detects broken html."


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=46705

Details

Reference
bz62323

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:05 AM
bzimport added a project: Parsoid.
bzimport set Reference to bz62323.

Bug 61011 is another problem when VE/Parsoid have different behaviour and causes problems with incorrect wiki-syntax.

We'll investigate why it bombed so spectacularly. We have had various fixes over time that should have prevented this from happening, but there will still be some corner cases that will get through. The current parser and Parsoid parse wikitext using very different techniques. Because of that, it is not possible to replicate the PHP parser behavior in all broken wikitext cases.

In addition, check the bug report I linked to here. As part of GSoC (and possibly OPW), we are going to have 1/2 students work on surfacing all the broken wikitext information for fixups (either by bots or editors via a tool). See the linked bug report for more info.

But, we could also consider exposing some of it to VE, but we are not sure if that is a direction we want to go yet since that adds additional complexity to VE, especially when Parsoid won't know a priori what kind of broken wikitext will cause these kind of failures to happen. It might be better to put the energy into getting this wikitext fixup tool in place. The hope is that we'll have something usable in 6 months.

(In reply to Helder from comment #4)

Here is another example:
https://pt.wikipedia.org/wiki/Special:Diff/38334798

See below for a reduced test case on which this can be reproduced.

{|

- > <math>x</math>
}

Two issues here. The ">" char throws off the recent accept crap in attributes (https://gerrit.wikimedia.org/r/#/c/115856/) patch. The <math> tag in turn seems to throw off our fostered content marking algorithm in dom.markFosteredContent.js. So, this is very much an edge case in that sense, but we'll at least make our fostering algorithm more robust so selser can more reliably ignore it.

(In reply to Gabriel Wicke from comment #3)

Verified to not round-trip correctly:
http://parsoid-lb.eqiad.wikimedia.org/_rtselser/enwiki/
Dancing_with_the_Stars_(U.S._TV_series)?oldid=598141704

As for this, our zero-out-dsr-range-for-fostered-content code seems to have stumbled on fostered templates for some reason. See the non-zero (6295,22379) range for the fostered p-tag below which also happens to be a transclusion.

<p data-parsoid="{"fostered":true,"autoInsertedEnd":true,"dsr":[6295,22379,null,null]}" typeof="mw:Transclusion" about="#mwt191" ...>..</p>

Possibly another edge case -- probably the dsr computation code implicitly zeroes this. Instead, we should probably have a pre/post pass do this explicitly and zero out these ranges.

BTW, someone suggested the idea of a notice if the HTML was detected to be bad. I like that idea, but I would start with a console.log message or something, instead of a banner. Might help debugging pages easier.

(In reply to Derk-Jan Hartman from comment #8)

BTW, someone suggested the idea of a notice if the HTML was detected to be
bad. I like that idea, but I would start with a console.log message or
something, instead of a banner. Might help debugging pages easier.

We already have this in the form of round-trip testing.

Change 117642 had a related patch set uploaded by GWicke:
Bug 62323: Eat > and [ in table / tr attribute names too

https://gerrit.wikimedia.org/r/117642

Change 117642 merged by jenkins-bot:
Bug 62323: Eat > and [ in table / tr attribute names too

https://gerrit.wikimedia.org/r/117642

(In reply to Gabriel Wicke from comment #9)

(In reply to Derk-Jan Hartman from comment #8)

BTW, someone suggested the idea of a notice if the HTML was detected to be
bad. I like that idea, but I would start with a console.log message or
something, instead of a banner. Might help debugging pages easier.

We already have this in the form of round-trip testing.

How does round-trip testing help the end user. For example take
https://en.wikipedia.org/w/index.php?title=1983_San_Miguel_Beermen_season&oldid=597926082
among other problems this has broken tags with a line

Carlson Samlani <sup> Acquired from defunct U-Tex Wranglers ,/sup>

I'm not seeing anything in the console.

That page actually had a worse bug, with a broken bit of markup in its infobox template. I've created a testpage using a sandboxed copy of the broken template

https://en.wikipedia.org/wiki/User:Salix_alba/1983_San_Miguel_Beermen_season

In the normal editor everything looks fine. In VE when you edit it the whole page is included in a single template and virtually impossible to edit.

A normal editor coming to this page and trying to use VE is going to get very confused. A warning message might help. A console message might have helped with the debugging of the template.

I've now fixed the template so things are fine now
https://en.wikipedia.org/w/index.php?title=Template%3AInfobox_PBA_season&diff=598663098&oldid=575403049

(In reply to Richard Morris from comment #12)

How does round-trip testing help the end user.

It doesn't, but the same is arguably true for console output.

We are mentoring a GSoC project to systematically collect and expose issues in Wikitext that Parsoid detects, see bug 46705. This will help end-users and bot authors to systematically fix classes of issues. We hope to integrate with the existing WikiCleaner project which already highlights a large class of issues. Parsoid can detect another class of issues (mis-nesting / unclosed tags for example), so should make that project even more useful.

Change 117933 had a related patch set uploaded by Subramanya Sastry:
WIP: Bug 62323: Tweaks to fostering code

https://gerrit.wikimedia.org/r/117933

Change 117933 merged by jenkins-bot:
Bug 62323: Several tweaks to DOM passes to handle fostered content.

https://gerrit.wikimedia.org/r/117933

The immediately actionable bits of this bug report have been addressed in Parsoid: (a) accept more of broken wikitext in table/tr attributes (gerrit 117642) (b) better handling of fosterable content (gerrit 177933).

Can this be resolved or do we still want to discuss passing warnings to VE about broken wikitext? Here is my take on it to expand on Comment #2:

  • If Parsoid can detect broken wikitext, it can "handle" it automatically and protect against it -- either discard it as in (a) above or protect it against dirty diffs as in (b) above. Note that over tiem, a lot of protections have been put in place into Parsoid to detect and gracefully recover from broken wikitext. But, in short, if something broken is getting through to VE or is getting serialized to broken wikitext, it is because Parsoid hasn't detected it yet in which case VE cannot be warned either. So, the larger request of this bug report is not really addressable (unless I've misunderstood the request). What we could expose is what we've already fixed up or "protected" or maybe mark fixed up portions uneditable (ex: fostered content), etc. as I mentioned in Comment #2. But, I am not sure that is necessarily useful.
  • We also have a parallel project underway (right now, in the form of a GSoC project proposal) for "online" detection of broken wikitext and recording them for automatic / manual fixup. It is online in the sense that whenever Parsoid detects and recovers from broken wikitext, that information is also emitted to be recorded in a db for fixup. Where applicable, Parsoid will also generate possible automatic fixup information. So, the idea (or hope, if you will :)) is that this might lead to less broken wikitext. Remains to be seen how this will pan out.

Other ideas welcome.

Arlolra set Security to None.