Page MenuHomePhabricator

Parsoid outputs <small> as child of <figure>
Closed, ResolvedPublic

Description

On http://parsoid.wmflabs.org/enwiki/User:Tamravidhir?oldid=603142295 , I see two images that look like:

<figure><small><a href="..."><img src="..." /></a><figcaption>Caption text</figcaption></small></figure>

You can find them by opening that link in a browser and typing document.querySelectorAll('figure>small') in a console, then clicking on the result (Firebug) or right-clicking the result and choosing "Reveal in elements panel" (Chrome).

Putting a <small> inside a <figure> tag but wrapped around a <figcaption> tag is a violation of the image spec (in addition to just being insane), which causes VisualEditor to get very confused and corrupt the image, which causes Parsoid to output [[undefined|NaNxNaNpx|right|link=]], see https://en.wikipedia.org/w/index.php?title=User:Tamravidhir&diff=603142575&oldid=603142295 . This is reminiscent of bug 62805 (NaNxNaNpx).

The wikitext that triggers this is fairly complicated, and I haven't been able to isolate a test case locally. It seems to be something like:

  • '''Text'''

<small>
[[File:Foo|right|200px|Caption]]

Text

[[File:Foo|right|200px|Caption]]
</small>

  • '''Text'''

<small>
etc.


Version: unspecified
Severity: major

Details

Reference
bz63642

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:21 AM
bzimport added a project: Parsoid.
bzimport set Reference to bz63642.

I'm fairly certain that LinkHandler never outputs <small>, so my guess is that this is a foster-parenting bug.

(02:34:55 PM) RoanKattouw: This structure is in a table cell BTW
(02:35:00 PM) RoanKattouw: It's like the 104th child of a <td>
(02:35:08 PM) RoanKattouw: Mismatched tags seems plausible
(02:35:15 PM) RoanKattouw: There's a +</div> elsewhere in the dirty diff

Reduced test case: Every newline in there is important to reproduce this.

{|
|
<small>
[[File:Foo.jpg|right|Test]]
</small>
|}

node parse --trace html --dump dom:post-builder reveals that Parsoid generates a <p><small></p><figure>..</figure></small> (imagine them to be html tokens input to the HTML5 treebuilder) which the treebuilder helpfully fixes up to <p><small></small></p><figure><small>..</small></figure>

But, the odd thing is that the tree builder does this only when this is nested inside the table. Not sure if that is a feature or a bug. To be tested in Firefox and Chrome.

Replace <small> with <b>/<i>/.. and you see the same behavior. But, I think this is the same class of bugs around formatting tags wrapping certain kind of block tags and getting fixed up by HTML5 parsing semantics. In the case of unclosed tags, we have bug reports where a <s> or a <ul> formatting tag continues to the rest of the document (bug 50536, bug 57196). I am recording this info here for now. Unclear what the exact solution is for this here, but clearly, we need a generic solution for misnested formatting tags.

Reports from Roan on IRC:

  • Not very common, but seen on a few user pages.
  • This corrupts wikitext on save in pages where it is seen.
  • Community liasion wants this fixed in some fashion.
  • More important to protect content than provide editability.

If possible, we should probably wrap such images with uneditable protection (mw:Placeholder) and deal with the larger issue later.

Firefox fixes this up as follows:

div=document.createElement('div')
div.innerHTML = "<p><small></p><figure><figcaption>foo</figcaption></figure</small>"
div.outerHTML

"<div><p><small></small></p><figure><figcaption><small>foo</small></figcaption></figure></div>"

But:

div.innerHTML = "<p><small></p><figure><a href='xxx'><img></a><figcaption>foo</figcaption></figure></small>"
div.outerHTML

"<div><p><small></small></p><figure><small><a href="xxx"><img></a><figcaption>foo</figcaption></small></figure></div>"

So it seems to be the <a> tag in particular which causes the treebuilder to resurface the <small> tag. The same thing happens if <small> is <b> or <i> or <s> or <u>, but not <a>:

div.innerHTML = "<p><a href=yyy></p><figure><a href='xxx'><img></a><figcaption>foo</figcaption></figure></a>"
div.outerHTML

"<div><p><a href="yyy"></a></p><figure><a href="xxx"><img></a><figcaption>foo</figcaption></figure></div>"

We might be able to handle this with a focused post-processing step which looks for <small>/<b>/<i>/<s>/<u> as the direct child of a <figure> and hoists the formatting safely inside the <figcaption>. In theory we should have a similar check in WTS in case we get invalid HTML as input.

For the WTS case, it's probably sufficient to tweak the code that looks for the various bits of <figure> and ensure it skips over any extraneous <small>/<b>/<i>/<s>/<u> as direct children of <figure>. That part wouldn't require a separate fixup pass.

Change 129818 had a related patch set uploaded by Cscott:
WIP: WTS: Handle tag fixup that floats inside <figure>.

https://gerrit.wikimedia.org/r/129818

Change 130109 had a related patch set uploaded by Subramanya Sastry:
WIP: (Bug 63642): Cleanup treebuilder fixup of formatting elts.

https://gerrit.wikimedia.org/r/130109

Change 130109 merged by jenkins-bot:
(Bug 63642): Cleanup treebuilder fixup of formatting elts.

https://gerrit.wikimedia.org/r/130109

This will be deployed on Wednesday.

Will close this .. Please reopen if this is not fixed post deploy tomorrow.

See bug 66749 which appears to be a dup?

Also, the WTS side of this ( https://gerrit.wikimedia.org/r/129818 ) never got finished/merged.

This is still causing problems. See https://ru.wikipedia.org/?diff=66125825 for instance.

$ echo "<small>[[File:Foo.jpg|right|300px]]</small>" | node tests/parse.js --wt2wt

<small>[[File:Foo.jpg|right|300px|link=]]

This seems to happen because of something similar to what I described in bug 66749 comment 7: there is a <small> inside a <figure>, which violates Parsoid's HTML spec, and causes a round-trip failure. Additionally, there's a frivolous <p><small></small></p> before and <p></p> after the image.

(In reply to ssastry from comment #15)

Any thoughts about https://gerrit.wikimedia.org/r/#/c/162816/?

I don't understand the patch very well, but the changes to the test cases look promising.

That patch also fixes your example from Comment 14:

[subbu@earth lib] echo "<small>[[File:Foo.jpg|right|300px]]</small>" | node parse.js --wt2wt
<small>[[File:Foo.jpg|right|300px]]</small>

Okay, will revive it and try to get a resolution on it soon enough. I need to study behavior on more convoluted test cases and see what Tidy does and what we do now and see if I can fold that into this patch.

Arlolra set Security to None.
Arlolra subscribed.

Change 215058 had a related patch set uploaded (by Subramanya Sastry):
T65642: Additional scenarios dealing with treebuilder fixup

https://gerrit.wikimedia.org/r/215058

ssastry claimed this task.
ssastry removed a project: Patch-For-Review.

This is now fixed and will be deployed on Monday. Please reopen if you see this not fixed in production.