Page MenuHomePhabricator

Parsoid is too aggressive about marking content surrounding templates as template-generated
Closed, ResolvedPublic

Description

This comes from hewiki, but seems to be unrelated to the language (?)

Go to this page: https://he.wikipedia.org/w/index.php?title=%D7%90%D7%99_%D7%94%D7%97%D7%96%D7%99%D7%A8&oldid=15683898

The first sentence is not part of the original template but it's being marked as if it is in VE.

The template is here: https://he.wikipedia.org/wiki/%D7%AA%D7%91%D7%A0%D7%99%D7%AA:%D7%90%D7%99


Version: unspecified
Severity: normal

Details

Reference
bz67554

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:36 AM
bzimport added a project: Parsoid.
bzimport set Reference to bz67554.

I have observed this kind of behavior pretty frequently. It appears that if a template outputs a newline at the end (which is common, and easy to do by accident), Parsoid will consider the entire paragraph to be template-generated.

In this particular case it looks like there was no newline between the infobox template and the start of the content, and the template caused a paragraph break between it and the content and also inserted a category which for whatever reason was put inside of the paragraph instead of just before it, so the paragraph got marked as template-generated because it contained a category that came from the template.

See DOM of http://parsoid-lb.eqiad.wikimedia.org/hewiki/%D7%90%D7%99_%D7%94%D7%97%D7%96%D7%99%D7%A8?oldid=15683898

(In reply to Roan Kattouw from comment #1)

I have observed this kind of behavior pretty frequently. It appears that if
a template outputs a newline at the end (which is common, and easy to do by
accident), Parsoid will consider the entire paragraph to be
template-generated.

I take that back, that's only for double newlines. It does happen for single newlines if you're in a list though.

This is related to how paragraph-wrapping is done in parsoid.

[subbu@earth tests] echo "[[Category:Foo]]abc" | node parse --normalize

<p><link href="Category:Foo"/>abc</p>

I made a number of fixes recently where unrelated content at the extremities like this is left out of paragraphs (https://gerrit.wikimedia.org/r/#/c/166891/ - bug 71361), but looks like I missed some cases.

Arlolra claimed this task.

Closing. That WIP was merged.

echo "[[Category:Foo]]abc" | node parse --normalize

<link href="Category:Foo"/>
<p>abc</p>

and the example above on hewiki is fixed as well.