Page MenuHomePhabricator

Unhandled <pre> tokenizing scenarios in tokenizer
Closed, ResolvedPublic

Description

See test case below. For some reason the <pre> inside the <blockquote> and the p-tag before the blockquote (all of those conditions are necessary to reproduce the bug) is causing the content after the blockquote to not be wrapped in p-tags. See output below. Probably some edge case in the paragraph-wrapping code.

Based on bug report here:
https://en.wikipedia.org/w/index.php?title=Wikipedia:VisualEditor/Feedback&oldid=575631753#VE_removing_paragraph_gaps

[subbu@earth lib] cat /tmp/x
a

<blockquote><pre>
b
</pre></blockquote>

c

d
[subbu@earth lib] node parse < /tmp/x
<body data-parsoid='{"dsr":[0,49,0,0]}'><p data-parsoid='{"dsr":[0,1,0,0]}'>a</p>

<blockquote data-parsoid='{"stx":"html","dsr":[3,42,12,13]}'><pre data-parsoid='{"stx":"html","autoInsertedEnd":true,"strippedNL":"\n","dsr":[15,29,5,0]}'>

b
&lt;/pre&gt;</pre></blockquote>

c

d
</body>


Version: unspecified
Severity: normal

Details

Reference
bz54946

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:38 AM
bzimport set Reference to bz54946.

Changing that to "<blockquote><pre> b </pre></blockquote>" does not trigger the bug. So, p-tags before blockquote and HTML-pre in blockquote with content on new line puts the p-wrapper in a state where p-tags are not added.

This is actually a tokenizer bug. The closing </pre> is not being recognized as an end-tag when a HTML <pre> follows another literal HTML tag on the same line.

Relevent snippet of output for: "a\n\n<span><pre>\nb\n</pre></span>"
...
<span data-parsoid='{"stx":"html","dsr":[3,28,5,6]}'><pre data-parsoid='{"stx":"html","autoInsertedEnd":true,"strippedNL":"\n","dsr":[8,22,5,0]}'>

b
&lt;/pre&gt;</pre></span>
...

Change 87632 had a related patch set uploaded by Subramanya Sastry:
(Bug 54946) Fix unhandle <pre> tokenizing scenarios

https://gerrit.wikimedia.org/r/87632

Change 92469 had a related patch set uploaded by GWicke:
WIP Bug 54946: Alternative solution for <pre> tokenization

https://gerrit.wikimedia.org/r/92469

Change 92469 merged by jenkins-bot:
Bug 54946: Alternative solution for <pre> tokenization

https://gerrit.wikimedia.org/r/92469

Change 87632 abandoned by Subramanya Sastry:
(Bug 54946) Fixed unhandled <pre> tokenizing scenarios

Reason:
Old and rusty and I am not going to look at this as I originally thought.

https://gerrit.wikimedia.org/r/87632