Page MenuHomePhabricator

Parsoid changes </tr> to </tr in HTML tables with </tr> but no <tr>
Closed, ResolvedPublic

Description

When the wikitext contains an HTML table that includes </tr> tags but no corresponding <tr> tags, the table renders fine in viewing mode and in VE.
e.g.

<table> <tr> <th>H1</th> <th>H2</th> <th>H3</th> </tr> <td>[[Fish]]</td> <td>[[Cake]]</td> <td>[[Scone]]</td> </tr> <td>[[Goldfish]]</td> <td>[[Golden syrup]]</td> <td>[[Cream]]</td> </tr> </table>

Renders as intended.

However, when any edit to the table is made, Parsoid (presumably) strips the < from all the </tr> tags.
Sandbox: https://en.wikipedia.org/w/index.php?title=User:Thryduulf/sandbox&diff=570513680&oldid=570513611
Reported dif: https://en.wikipedia.org/w/index.php?title=2005%E2%80%9306_Polish_Basketball_League&diff=570501618&oldid=569248952

This isn't something that Parsoid can expect to encounter frequently, but it shouldn't make a mess of it when it does.


Version: unspecified
Severity: minor

Details

Reference
bz53464

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:57 AM
bzimport added a project: Parsoid.
bzimport set Reference to bz53464.

Unrelated thing we could also fix. the "|-" could instead be serialized as <tr>..</tr> based on HTML syntax rest of the table.

[subbu@earth tests] node parse --wt2wt --fetchConfig false < /tmp/tbl

<table> <tr> <th>H1</th> <th>H2</th> <th>H3</th> </tr> |-<td>[[Fish]]</td> <td>[[Cake]]</td> <td>[[Scone]]</td> |-<td>[[Goldfish]]</td> <td>[[Golden syrup]]</td> <td>[[Cream]]</td> </table>

This sounds a lot like a slightly wrong DSR.

There are two different pieces of this:

  1. With the missing HTML <tr> tag, Parsoid incorrectly assumes that the missing <tr> tag inserted by the tree builder comes from wikitext (which can be seen by the |- output in the non-selser serialization above). This confusion is the reason for the off-by-1 DSR error. I can fix that in the parser (and have a fix for it) with an additional check to determine whether a <tr> comes from wikitext.
  1. However, this also needs a fix in the serializer. Where the content of the <tr> is unchanged, selser accurated spits back the original wikitext. However, the serializer also needs a fix so that it attempts HTML-tag serialization instead of wikitext-tag serialization on the <tr>. That will fix the second issue.

As for 1. the problem is in migrateTrailingNLs. I fixed it, but I think we added this dom pass to deal with some issues in the WTS during refactoring to eliminate spurious diffs around trailing newlines and separators. I am tempted to simply rip out that pass and see what happens (which as a side effect, also fixes part 1. of this bug). For now, maybe I will push a fix for both, but I think we should consider getting rid of that pass altogether, if possible since it may not be necessary anymore.

You know, I can fix this, but it will be kind of hacky ... I am going to push my changes into a local branch and revisit this after I look at removing the migrateTrailingNLs handler. Deleting code is better than adding code. :-) This is not a critical bug anyway.

I believe we added migrateTrailingNLs at some point in the past to accomodate missing trailing WS handling in VE. We tried to keep all trailing whitespace outside of paragraphs etc. Afaik VE can handle that now, so there should not be a need for the migration any more.

Change 81600 had a related patch set uploaded by Subramanya Sastry:
(Bug 53464): Improved detection of missing opening HTML tags

https://gerrit.wikimedia.org/r/81600

Never mind. I found a simple and easy non-hacky fix which might apply in other situations as well and fixes all issues reported in this bug. I'll look at getting rid of migrateTrailingNLs separately tomorrow/next week.

Change 81600 merged by jenkins-bot:
(Bug 53464): Improved detection of missing opening HTML tags

https://gerrit.wikimedia.org/r/81600

Subbu / Gabriel: Patch was merged month ago - is there more work left here, or can you close this ticket as RESOLVED FIXED?