Page MenuHomePhabricator

WTS: 5 quotes
Closed, ResolvedPublic

Description

This is the unfixed part of bug 62569.

Good:
$ echo "'''''" | tests/parse.js --wt2wt
'''''

Bad:
$ echo "<p><b><i></i></b></p>" | tests/parse.js --html2wt
''''''''''
$ echo "<p><b><i></i></b></p>" | tests/parse.js --html2html --normalize
<body><p>'''''<b><i></i></b></p></body>

This is the "5 quotes, code coverage +1 line" tests case in parserTests.txt.

It seems like there's still a quote-handling bug in WTS if we don't have data-parsoid to guide us.


Version: unspecified
Severity: normal

Details

Reference
bz63119

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:04 AM
bzimport added a project: Parsoid.
bzimport set Reference to bz63119.

This seems like a bug in the front-end tokenizer, not the serializer (which serializes the html just fine).

[subbu@earth lib] echo "<p><b><i></i></b></p>" | node parse --html2wt | node parse --trace peg-tokens
trace/peg-tokens : TOKS: ["'''''",{"type":"SelfclosingTagTk","name":"mw-quote","attribs":[],"dataAttribs":{"tsr":[5,10]},"value":"'''''"}]

The first 5 quotes are tokenized as a plain string rather than as a mw-quote token.

Another case:
$ echo "''foo'''''" | tests/parse.js --normalize=parsoid
<body><p><i>foo</i><b></b></p></body>
$ echo "<p><i>foo</i><b></b></p>" | parse.js --html2html --normalize=parsoid
<body><p><i>foo'''</i><b></b></p></body>

This is the "Italics and bold: 2-quote opening sequence: (2,5+3)" test case.

I am not sure how much effort we should invest in preserving html2html for empty quote nodes as in these examples.

But, that said, one way to fix "<b><i></i></b>" is to insert a <nowiki/> in the empty node to break the quote block. '''''<nowiki/>'''''. This will still not preserve html2html exactly, but it will preserve semantics.

In the case in comment 2:
$ echo "<p><i>foo</i><b></b></p>" | tests/parse.js --html2wt
''foo''''''''
$ echo "''foo''''''''" | tests/parse.js --normalize
<body><p><i>foo'''</i><b></b></p></body>
$ echo "''foo''''''''" | php maintenance/parse.php
<p><i>foo'''</i>
</p>

But:
$ echo "''foo'''''<nowiki/>'''" | tests/parse.js --normalize
<body><p><i>foo</i><b><meta/></b></p></body>
$ echo "''foo'''''<nowiki/>'''" | php maintenance/parse.php
<p><i>foo</i>
</p>

So it does seem like our WTS should insert the <nowiki/> node there to preserve the semantics of the HTML.

Change 121141 had a related patch set uploaded by Cscott:
Fix WTS of empty quote nodes.

https://gerrit.wikimedia.org/r/121141

Change 121141 merged by jenkins-bot:
Fix WTS of empty quote nodes.

https://gerrit.wikimedia.org/r/121141