Page MenuHomePhabricator

Apply tag minimization on incoming HTML (ex: merge identical adjacent i/b tags)
Closed, ResolvedPublic

Description

The normalize_document algo that rewrites dom to deal with nested inline tags does not work in the presence of mw:StartTag and mw:EndTag meta-tags. The algo needs fixup to handle that. Plus, the reordered nodes also need their tsr and other attrs. patched up.

longest_linear_path function is one that checks for children.length > 1 -- and this check is not valid anymore. Look for other such checks and update.


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=49873
https://bugzilla.wikimedia.org/show_bug.cgi?id=53208

Details

Reference
bz42803

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 12:54 AM
bzimport added a project: Parsoid-DOM.
bzimport set Reference to bz42803.

CSA notes: see http://www.mediawiki.org/wiki/Parsoid/PostProcessor:DOM_Tag_Minimization

Also -- we might need to handle <a> tags in this algorithm, to ensure that <a> doesn't get broken up by reordering the other tags. In particular, look at the wikitext snippet:

''Something [http://example.com mixed''''', even bold]'''

which should not break up the <a> tag.

Related URL: https://gerrit.wikimedia.org/r/60617 (Gerrit Change I25711292536b66d33c8228ca01c526ccfa93875b)

The tag minimization is running out of stack space. The algorithm needs to be rewritten with an external stack.

See https://gerrit.wikimedia.org/r/60617

Disabling this (predictably) breaks our serializer handling of adjacent i/b tags, which was reported in bug 48194. We should write an iterative version of longest_linear_path.

Why does disabling the DOM pass (on wt2html) break our serializer handling of i/b tags on (html2wt)?

Something like <i>foo</i><i>bar</i> *could* be serialized as ''foo''<nowiki/>''bar'', but that is not what we are currently doing (and nor is it very desirable IMO).

Minimization avoids this issue by merging adjacent italic/bold ranges. With it enabled, the result will be ''foobar''.

But, <i>foo</i><<i>bar</i> could be fresh content and not be from original source. Are you suggesting we run this on the dom we get from VE before serialization?

Yes, that was the original idea. Minimization on the way out is still nice for round-tripping (and HTML sanity), but on the way in it is actually important for serialization.

For on the way in, meta tags are not an issue since. So, it can already be applied if we fix the stack space issue.

var dpp = require('./mediawiki.DOMPostProcessor.js'), d = require('./domino.js')

undefined

var doc = d.createDocument('<i>foo</i><i>bar</i>')

undefined

dpp.normalizeDocument(doc)

undefined

doc.outerHTML

'<html><head></head><body><i>foobar</i></body></html>'

Anyway, but yes, that code is quite stale and requires upgrade and testing and can be enabled on the DOM on the way in. Not sure if scott is still interested in this ticket. Otherwise, I can tackle this one of these days.

Commit 1c80e2d7f06403a49c80abfbd69b8b1fad039786 has some test DOMs.

  • Bug 49478 has been marked as a duplicate of this bug. ***
  • Bug 49766 has been marked as a duplicate of this bug. ***
  • Bug 49873 has been marked as a duplicate of this bug. ***

Related URL: https://gerrit.wikimedia.org/r/69852 (Gerrit Change Ibb18b4f653c664e8ab7876498dc8395d878f7aaa)

Just recapping an earlier IRC discussion: my opinion is that the general form of this bug is too aggressive (globally rewrites the input too drastically) and unnecessary. I prefer to see local fixes such as the one in comment 16 that address the straightforward local cases (like merging adjacent <span> or other tags with identical attributes) rather than trying to find globally-optimal tag nestings (which may create surprising long-distance mutations which can also be costly to search for).

My opinion may well be wrong. Hopefully the bugs linked as dups can be evaluated in this light; one of them may well be the killer case which simply can't be solved with a local transform and instead needs the resurrection of the global tag minimization code. I look forward to being thus shown incorrect. ;)

(In reply to comment #17)

Just recapping an earlier IRC discussion: my opinion is that the general form
of this bug is too aggressive (globally rewrites the input too drastically)
and
unnecessary.

The incoming normalization will only apply to ranges where at least a part is new or modified content. So no global rewrites.

A proper solution should not be thrown off by other unrelated formatting elements, so only looking for siblings won't quite cut it. This is easier in VE as they are using flat annotations instead of elements (see comment 16). Our equivalent needs to flatten nested elements into ranges, which is basically what this algo already implements.

With this change: https://gerrit.wikimedia.org/r/#/c/73523/ it is now possible that VE will send back adjacent annotations, e.g. by loading a document with two bolds separated by text, and then deleting the text between them, we will send back <b>Foo</b><b>Bar</b>, which currently serialises to '''Foo''''''Bar''' (which goes back to <b>Foo'<i>'</i>Foo</b>) so there is potential for corruption.

Given all the discussion we've had, I think we should implement a quick and simple solution for now. Here is my proposal.

As far as I can tell, the only place where tag reordering and minimization matters is where wikitext I and B tags end up as siblings. So, just targeting that situation should give what we want.

  1. ..</i><i> ..
  2. ..</b><b> ..
  3. ..</i></b><i> ..
  4. ..</b></i><b> ..

These 4 scenarios boil down to two cases which can be handled fairly easily in the serializer.

While HTML like this <s><i><b>a</b></i></s><b>c</b> (and other variants) could also be minimized (and would indeed be handled by the algorithm under discussion that I implemented), serializing this to <s>'''''a'''''</s>'''c''' should be sufficient since there is no ambiguity in the parse in the front end and hence no corruption.

  • Bug 52876 has been marked as a duplicate of this bug. ***

(In reply to comment #20)

Given all the discussion we've had, I think we should implement a quick and
simple solution for now. Here is my proposal.

As far as I can tell, the only place where tag reordering and minimization
matters is where wikitext I and B tags end up as siblings. So, just
targeting
that situation should give what we want.

  1. ..</i><i> ..
  2. ..</b><b> ..
  3. ..</i></b><i> ..
  4. ..</b></i><b> ..

These 4 scenarios boil down to two cases which can be handled fairly easily
in
the serializer.

We also need to consider variants where other tags are in the mix. Example:

<i><ins><b>bold-italic</b></ins></i><b>bold</b>

This happens to be a case that is handled by our minimization algorithm. The main deficiency is that the current recursive implementation can run out of stack space: https://gerrit.wikimedia.org/r/#/c/60617/

The other issue is that it currently applies to both old and new content. For incoming content it should only apply to b/i pairs in new/modified content.

I still think the simplest solution is to fix the algorithm to be iterative, and to enable it only on new / modified incoming content on the way in.

I think for that example: ''<ins>'''bold-italic'''</ins>'''''bold''' is an acceptable serialization which parses back to the same HTML.

But, the minimized HTML <b><i><ins>bold-italic</ins></i>bold</b> can be serialized to '''''<ins>bold-italic></ins>''bold'''. While shorter, this does not parse back to the original HTML, but to <i><b><ins>bold-italic</ins></b></i><b>bold</b>.

I am no longer convinced that we need a generic minimization solution rather than a special-case solution just for b-i tags. This will just let us strip out a big part of our codebase.

So, I am going to tackle this bug in the next couple days and first attempt the simplified solution and fall back to fixing the existing algorithm otherwise.

That said, a small part of me will be sad at not using it, because that algorithm was pretty much one of my first commits to the codebase ;-) and I think it is a neat general-purpose algorithm for rewriting interleaved tags to a minimal form.

(In reply to comment #23)

I think for that example: ''<ins>'''bold-italic'''</ins>'''''bold''' is an
acceptable serialization which parses back to the same HTML.

But, the minimized HTML <b><i><ins>bold-italic</ins></i>bold</b> can be
serialized to '''''<ins>bold-italic></ins>''bold'''. While shorter, this
does
not parse back to the original HTML, but to
<i><b><ins>bold-italic</ins></b></i><b>bold</b>.

Which then serializes to

'''''<ins>bold-italic</ins>''bold'''

This is semantically identical, and round-tripping is no issue for new or modified content. Do you see any issue in simply doing this apart from the need to fix up the recursion issue?

That (I think that function can easily be converted to tail-recursive form, if it is not already, and then iteration) + feels like using a very heavyweight technique for a relatively simple problem + deleting code is almost always a good thing. It would be more of a no-brainer if there were other scenarios that required this generic solution.

Change 80335 had a related patch set uploaded by Subramanya Sastry:
(Bug 42803) Apply I/B minimization to incoming HTML

https://gerrit.wikimedia.org/r/80335

  • Bug 48194 has been marked as a duplicate of this bug. ***

Change 80335 merged by jenkins-bot:
(Bug 42803) Apply I/B minimization to incoming HTML

https://gerrit.wikimedia.org/r/80335

  • Bug 53208 has been marked as a duplicate of this bug. ***