Page MenuHomePhabricator

Fix parsing of infobox templates used in table attribute position
Closed, ResolvedPublic

Description

Wikitext of this form:

{| {{Infobox aircraft begin
 .. 
}} ... 
|}

is used on several pages (ex: [[en:B757]]).

This essentially generates table attrs. and also table content and is currently not parsed properly by Parsoid. The tokenizer adds the template to the table's attribute array, and it gets stuck there for good through the entire parse which means the entire tpl content is "lost".

There are a couple different ways of handling this:

(1) Fix AttributeExpander so that it hoists tokens seen after a NlTk out of the attribute, and after the token. Inelegant, but might do the trick.

(2) Fix this template (and any other template using this technique) so that it is self-contained rather than generate pieces of the table. At the very least, move "{|" inside the "Infobox aircraft begin" template. Since it is not feasible to edit all pages that transclude the template, we might have to provide support for the extra "{|" table tag found outside the template.

Just recording all this information here for now, along with a test case below:

{|{{Infobox aircraft begin
  |name= Boeing 757
  |image= File:Icelandair Boeing 757-256 Wedelstaedt.jpg<!-- Flight images are preferred for aircraft. Do not change image without a talk page discussion first, thanks. -->
  |caption= [[Icelandair]] Boeing 757-200 on final approach
  |alt=A mostly white Boeing 757 with blue and yellow trim preparing for landing against a blue sky. Landing gear and flaps are extended in final approach configuration.
}}{{Infobox aircraft type
  |type= [[Narrow-body aircraft|Narrow-body]] [[jet airliner]]
  |national origin= United States
  |manufacturer= [[Boeing Commercial Airplanes]]
  |designer=
  |first flight= February 19, 1982
  |introduction= January 1, 1983 with [[Eastern Air Lines]]
  |retired=
  |status= In service
  |primary user= [[Delta Air Lines]] <!--Limit one (1) primary user. Top 4 users listed in 'primary user' and 'more users' fields based on the number in their fleets. -->
  |more users= [[United Airlines]] <br>[[American Airlines]] <br>[[UPS Airlines]] <!-- Limit is three (3) in 'more users' field, four (4) total users with primary user. Please separate with <br/>.-->
  |produced= 1981–2004
  |number built= 1,050<ref name="last757built">{{cite web|url=http://www.boeing.com/news/releases/2004/q4/nr_041028g.html |title=Boeing Marks Completion of its 757 Commercial Airplane Program |date=October 28, 2004 |publisher=Boeing |accessdate=January 26, 2011}}</ref>
  |unit cost= 757-200: US$65&nbsp;million (2002) <br>757-300: US$80&nbsp;million (2002)
  |variants with their own articles= [[Boeing C-32]]
}}
|}

Version: unspecified
Severity: normal
URL: http://parsoid-prod.wmflabs.org/testwiki/46811?oldid=218081

Details

Reference
bz46811

Related Objects

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:37 AM
bzimport added a project: Parsoid.
bzimport set Reference to bz46811.

Variant 2) (extra table start tag inside the template) is supported by the PHP parser, but currently not by Parsoid. There might be other broken template transclusions that are currently broken in Parsoid for this reason, so we should probably add support for it in any case.

[Parsoid component reorg by merging JS/General and General. See bug 50685 for more information. Filter bugmail on this comment. parsoidreorg20130704]

Smaller test case than the one in the Description.

{|{{Infobox Aircraft Begin

name=Mitsubishi 1MT
image=Mitsubishi 1MT.jpg
caption=

}}{{Infobox Aircraft Type

type=[[Triplane]] [[torpedo bomber]]
national origin=Japan
manufacturer=[[Mitsubishi]]
designer=Herbert Smith
first flight={{avyear1922}}
introduced=
retired=
status=
primary user=[[Imperial Japanese Navy Air Service]]
more users=
produced=
number built=20
variants with their own articles=

}}

  • Bug 53139 has been marked as a duplicate of this bug. ***
  • Bug 52889 has been marked as a duplicate of this bug. ***

I guess one hacky solution to this would be to add another 'in table attribute position' flag to the transclusion processing pipeline. We could then pre-pend '{|' to the returned transclusion source in order to make it parse correctly. The tricky bit is that we'll likely have to also strip the extra table token to avoid foster-parenting of the entire table.

These templates seem to be quite common and won't all be fixed up soon, so it might be worth adding such a hack.

The soln in Comment 6 could work but it has to be able to continue to deal with (currently supported) scenarios like:

{| style='color:red;' {{echo|class='test'}}\n|foo\n|}
{| {{echo|class='test'}} style='color:red;'\n|foo\n|}

So, in general, it has to handle the following scenario.

{| <0 or more attrs> {{tpl here}} <0 or more attrs>
..

}

In addition, this has to also deal with:

{| <0 or more attrs> {{tpl-that-generates-more-attrs-and-content}}
..

}

So, this has the potential to get complex / fragile. I think the alternative -- Variant (1) in the description -- might potentially work out simpler and more robust since it can handle all of these scenarios identically.

So, this is what the AttributeExpander sees in @ line 150 for newK.

["class=\"infobox \" style=\"float: right; clear: right; width: 315px; border-spacing: 2px; text-align: left; font-size: 90%;\"",{"type":"NlTk","dataAttribs":{}},{"type":"TagTk","name":"th","attribs": ... }]

Hoisting everything from NlTk after the table token could work. But, you still need to clean up after this fixup since the template expansion would have been wrapped in meta-tags. So, you would have to hoist the tpl-start meta-tag to a position before the table-start ...

So, instead of returning: cb( { tokens: [token] } ) .. you might fix this to return [meta-start, token-with-fixed-attrs, NlTk-and-all-other-tokens-found-in-attr]

Lot of hand-waving going on ... but, in either scenario, the template wrapping will need fixing up in some fashion.

Change 173834 had a related patch set uploaded by Subramanya Sastry:
WIP: Bug 46811: Quick test hack

https://gerrit.wikimedia.org/r/173834

ssastry set Security to None.
ssastry raised the priority of this task from Medium to High.

Change 173834 had a related patch set uploaded (by Subramanya Sastry):
WIP: T48811: Quick test hack

https://gerrit.wikimedia.org/r/173834

Patch-For-Review

I think, this is an other occurence of this bug:
http://parsoid-lb.eqiad.wikimedia.org/idwiki/Kota_Batu?oldid=8180803

No, I don't think this is related. It looks like mismatched quotes since I see this in the output:

<td style="background:#FFFFFF; align="center" colspan="2">

Notice that the closing quote around style attribute is missing. This is a known behavior mismatch between parsoid and php parser. If the page / template is fixed up, that page should render properly (I haven't gone looking where the problem is).

Light at the end of the tunnel. The patch is now ready for final reviews. I've handled a bunch fo scenarios in this patch and added corresponding tests. Pages like enwiki:Boeing_757 now render identically and also RT the infobox without dirty diffs.

The unhandled scenarios (both of them documented in the commit message) are because our DOM spec doesn't have a representation for them. However, till we determine that those other use cases are common and need support, let us continue to leave them unsupported.

But, this patch should get us a long way forward on the largish subset of pages that used to render incorrectly with parsoid.

Change 173834 merged by jenkins-bot:
T48811: Fit a square peg in a round hole

https://gerrit.wikimedia.org/r/173834

There are a lot of these kind of templates out there.

Ex: https://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:TableauM%C3%A9daillesTriable&action=edit

I imagine other wikis have similar templates.

@Esanders, @Catrope could you test VE with the latest master and we should discuss how to improve the editing experience in these scenarios -- right now, the whole table will effectively get wrapped as a unit which makes for poor editing experience.

The other thing to do is update our data-mw spec for expanded attributes since multiple attributes from a template seems quite common.

Ex: https://hi.wikipedia.org/wiki/%E0%A4%B8%E0%A4%BE%E0%A4%81%E0%A4%9A%E0%A4%BE:Yearheader

This patch that was merged fixes the broken rendering for all these pages but it still doesn't do anything for RT-ability.

This has now been deployed and everything looks okay. I am going to open new tasks for followup work.