Page MenuHomePhabricator

Extension tag text (ex: <ref>) with opening/closing tags in different DOM nodes are not nowiki-escaped
Closed, ResolvedPublic

Description

Input HTML

<p>Foo<ref>Bar</ref>Baz</p>

Expected wikitext

Foo<nowiki><ref>Bar</ref></nowiki>Baz

Actual wikitext

Foo<ref>Bar</ref>Baz

This is rather problematic if a user is documenting MediaWiki operations. :-(


Version: unspecified
Severity: normal

Details

Reference
bz57469

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:39 AM
bzimport set Reference to bz57469.

Hmm ... I guess somehow this one fell through the cracks or we have had regressions (which means our wt-escaping tests have holes in them). I was sure this was being handled already, and a cursory glance through the escaping code shows that there are checks for extensions, etc. Anyway, we will investigate next week.

So, actually, this is a slightly different use case than what is currently supported.

Try entering any of the extension tags in VE and Review changes. Parsoid escapes the ext-tag like text as long as VE sends them back to Parsoid as text, i.e. <p>Foo&ot;ref&gt;Bar&lt;/ref&gt;Baz</p>. I verified this behavior in my local install of VE. This is similar to entering <div> or any other HTML tag as text in VE. If VE does not escape those tag strings to text, Parsoid will not be able to escape them.

Currently, Domino parses "<ref>foo</ref>" as a valid XML tree which is the same behavior as the browser (verified in the web console). So, while Parsoid could check the node name against the set of valid HTML5 tags and nowiki the others (since we are not serializing a generic XML domtree), why would VE not entity-escape "<" and ">" chars entered as text before sending them to Parsoid, i.e. is this a valid use case in the context of VE editing?

Sounds a lot like a VE bug then. James, can you verify that you are sending those tags over the wire with correct escaping?

(In reply to comment #2)

So, actually, this is a slightly different use case than what is currently
supported.

Try entering any of the extension tags in VE and Review changes. Parsoid
escapes the ext-tag like text as long as VE sends them back to Parsoid as
text,
i.e. <p>Foo&ot;ref&gt;Bar&lt;/ref&gt;Baz</p>. I verified this behavior in my
local install of VE. This is similar to entering <div> or any other HTML tag
as
text in VE. If VE does not escape those tag strings to text, Parsoid will not
be able to escape them.

That's how it should work, but on current master I get:

$ echo "<p>&lt;ref&gt;</p>" | node js/tests/parse.js --html2wt
<ref>
$ echo "<p>&lt;ref&gt; <code>&lt;ref&gt;</code></p>" | node js/tests/parse.js --html2wt
<ref> <code><ref></code>

Ah, I see .. The first example is correct behavior since isolated "<ref>" or "</ref>" are always parsed back as text and hence there is no need to nowiki-escape them.

However, the escaping code does need an update to deal with the second example. It currently only handles matching opening/closing extension tags in the same text node.

[subbu@earth tests] echo "&lt;ref&gt;foo&lt;/ref&gt;" | node parse --html2wt
<nowiki><ref>foo</ref></nowiki>
[subbu@earth tests] echo "&lt;ref&gt;foo" | node parse --html2wt
<ref>foo
[subbu@earth tests] echo "&lt;ref&gt;foo<p>&lt;/ref&gt;</p>" | node parse --html2wt
<ref>foo

</ref>

Unlike escaping for link/heading tags, where we only need to worry about wikitext on a single-line, the opening and closing tags for extensions can be split anywhere in the DOM. So, a conservative approach would be to always escape individual <ref> or </ref> tags which, in some cases, might lead to excess nowiki escaping, but will always be safe.

(In reply to comment #5)

So, a conservative approach would be to always
escape individual <ref> or </ref> tags which, in some cases, might lead to
excess nowiki escaping, but will always be safe.

That seems like a sane approach to me.

Change 103598 had a related patch set uploaded by Subramanya Sastry:
(Bug 57469) Fix for nowiki escaping of ext-tag like text

https://gerrit.wikimedia.org/r/103598

Change 103598 merged by jenkins-bot:
(Bug 57469) Fix for nowiki escaping of ext-tag like text

https://gerrit.wikimedia.org/r/103598

Stray </ref> tags on these pages leads to dirty diffs (but acceptable, IMO) because of the fix:

and a bunch of other pages.

Adding this info here for the record in case we need to dig up pages later.

Looks like stray </ref> tags are not that uncommon .. A few tens of dirty diffs in rt-testing are all probably because of this.

Another source of nowiki-ed dirty diffs seem to be because of buggy ref tags. That is captured in bug 59598

Together, these two categories of buggy ref-wikitext account for ~50 dirty diffs in rt-testing (out of about 160K pages tested). Still a small number overall, and selser will hide most of this in production, but could show up once in a while, but nothing that warrants more complex code in Parsoid to hide them.

(In reply to comment #10)

Still a small number
overall, and selser will hide most of this in production, but could show up
once in a while, but nothing that warrants more complex code in Parsoid to
hide
them.

Agreed. These diffs are also classified as syntactical in our rt test setup, so should not distract too much from serious semantic diffs.