Page MenuHomePhabricator

Serialize HTML DOM according to polyglot markup spec so that it can be parsed with HTML and XML parsers
Closed, ResolvedPublic

Description

To make it easy to process our output using both XML and HTML tools, we should serialize our output to so-called 'polyglot markup'. This means that our output will be valid XML *and* HTML5 at the same time (effectively XHTML).

The spec for polyglot markup is at http://dev.w3.org/html5/html-xhtml-author-guide/. The relevant differences to our current HTML5 serialization should be:

  • void elements are serialized with trailing / as in <br/>
  • only a small set of named entities is used, other entities are converted to character entities (&nbsp; becomes &#xA0;)

We can add either add this functionality in Domino, or create our own XMLserializer implementation that walks an arbitrary DOM.


Version: unspecified
Severity: normal

Details

Reference
bz53968

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:04 AM
bzimport added a project: Parsoid-DOM.
bzimport set Reference to bz53968.

+1 ... HTML5's normalized parsing is great, but lots of programming environments only have an XML parser handy.

Change 88904 had a related patch set uploaded by GWicke:
Bug 53968: Add XMLSerializer and use it to produce XHTML

https://gerrit.wikimedia.org/r/88904

Change 88904 merged by jenkins-bot:
Bug 53968: Add XMLSerializer and use it to produce XHTML

https://gerrit.wikimedia.org/r/88904