Page MenuHomePhabricator

Replace Tidy with a library that doesn't suck
Closed, DuplicatePublic

Description

I mean, just look at Tidy

Tidy is awful and we need to get rid of it. There's gotta be a less crappy library that will just close unclosed HTML tags without messing with their contents.


See Also:

Details

Reference
bz54617

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:15 AM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz54617.
bzimport added a subscriber: Unknown Object (MLST).

Brion suggested that switching the parser to use Parsoid would make HTMLTidy unnecessary.

So the main reason we went with Tidy in the first place I think was to ensure that we had well-formed (X)HTML output, so XML parsers wouldn't die and browsers wouldn't do exciting things if you had a stray </div> somewhere.

The core parser tries to do some HTML cleanup, but it was never very complete.

Possibilities include:

a) Fix the HTML fixups in the core parser, and make sure non-Tidy output is compatible with the current Tidy output

b) Replace Tidy with another tool that's less annoying

c) Replace the core parser with something that already outputs valid HTML5 (such as Parsoid)

Long-term I like c) but I don't think we're there yet. :)

a) and b) are the things I'd recommend looking at if we really want to kill tidy in the short/medium term.

We hope to be at a point where we can consider using Parsoid output for regular page views by next summer. See https://www.mediawiki.org/wiki/Parsoid/Roadmap.

In Parsoid, an HTML5 treebuilder provides the bulk of the required clean-up. We also approximate the PHP / tidy parser's deviation from the standard cleanup in custom passes to make sure that the semantics of content written against the current setup are preserved.

You can't possibly want to require every MediaWiki installation everywhere to use Parsoid? The node.js dependency is unacceptable in most scenarios.

Rephrasing summary to reflect that we don't intend to get rid of fixing unclosed tags, but Tidy specifically (we shouldn't kill Tidy without adding something else, so that makes the bug more "atomic")

Parsoid would only be needed for wikitext editing and -templating. HTML-only wikis would basically serve XHTML straight from storage.

(In reply to comment #6)

Parsoid would only be needed for wikitext editing and -templating. HTML-only
wikis would basically serve XHTML straight from storage.

You can't possibly want to require every MediaWiki installation everywhere to switch to editing raw HTML by hand (VE depends on Parsoid…).

(In reply to comment #7)

You can't possibly want to require every MediaWiki installation everywhere to
switch to editing raw HTML by hand (VE depends on Parsoid…).

VE is an HTML editor, so can be used without Parsoid.

(In reply to comment #8)

VE is an HTML editor, so can be used without Parsoid.

Well yeah, okay, this could work. VE, however, has certain software and hardware requirements not all computers meet. And there's the entire issue of "templating" which you dismissed with a single word, which I assume is currently not implemented without wikitext backing it.

VE also currently doesn't work for, say, talk pages (and please don't mention Flow, it will not be ready by next summer) or edit summaries, and there are certain pieces of the interface which show raw source code like diffs (I don't think anybody has implemented rich text diffs yet in MediaWiki, but this is something I'd really like to see).

Using Parsoid for page view is just not workable in short or mid term, no matter how much we would want it.

/offtopic

(In reply to comment #9)

Using Parsoid for page view is just not workable in short or mid term, no
matter how much we would want it.

Which issues do you see apart from rendering quality / compatibility?

(In reply to comment #10)

Which issues do you see apart from rendering quality / compatibility?

Compatibility/availability is the single showstopper issue here. I can't run server-side JavaScript on most free hostings.

(In reply to comment #11)

(In reply to comment #10)

Which issues do you see apart from rendering quality / compatibility?

Compatibility/availability is the single showstopper issue here. I can't run
server-side JavaScript on most free hostings.

So it's back to the policy question of what MediaWiki is intended to be - a great wiki for large- and medium-scale wikis, or a hodge-podge of tools which are limited by ease of download-the-zip-file installation over a proper management tool, rather than by what is best for users?

MediaWiki is intended to be both, if you ask me. I don't see how your question is relevant to the bug, since I am not proposing to make it a hodge-podge.

I run into too many problems because of Tidy. It's main flaw is that it is not compatible with HTML5; it hasn't been updated since 2008(!). Most problems stem from Tidy not allowing any block elements inside inline elements (which is allowed in HTML5), and kicks them out which results in broken HTML, even though its goal is to prevent exactly that.

Is there no library that has the same functionality and is up to date?

Fount a lilbrary called HTML Purifier, but that's more of a 'evil code' filter with some 'Tidy inspired' features. Probably not what we want.

There is also tidy-html5 [1], a fork that aims for full HTML5 support.

[1] https://github.com/w3c/tidy-html5

(In reply to Bartosz Dziewoński from comment #11)

(In reply to comment #10)

Which issues do you see apart from rendering quality / compatibility?

Compatibility/availability is the single showstopper issue here. I can't run
server-side JavaScript on most free hostings.

Nor can you typically run tidy there. Virtual machines are really cheap these days (starting at about $30 / year), so cost is no longer the issue that prevents people from installing better tools for the job. Missing packaging is another point, but that is also being addressed (parsoid is now debianized).

In any case, we are working on being ready to start using Parsoid HTML for normal page views this summer. We might not want to maintain the PHP parser in the longer term, and are thus less likely to spend much effort on replacing tidy right now.

(In reply to Gabriel Wicke from comment #16)

Nor can you typically run tidy there.

Citation needed. http://www.php.net/manual/en/book.tidy.php It's definitely more likely to be accessible than having node and being able to shell out.

Virtual machines are really cheap
these days (starting at about $30 / year), so cost is no longer the issue
that prevents people from installing better tools for the job.

$30 is not within the reach of everyone. There's also the fact that you have to have a credit card to get any reputable paid hosting, and that's also not a given in the whole world.

(In reply to Bartosz Dziewoński from comment #17)

$30 is not within the reach of everyone. There's also the fact that you have
to have a credit card to get any reputable paid hosting, and that's also not
a given in the whole world.

Depending on your use case there are also free options like Wikia and other non-profit options without ads. Free shared hosting is not automatically going to be more reputable than free VM hosting, nor do I see systematic differences in payment methods.

You are free to work on MediaWiki on shared hosting of course. All I'm saying is that there are few remaining reasons for us to

  • spend major resources on shared hosting support, and
  • let it hold back our architectural development at the expense of security, performance and maintainability

At some point I would like to replace tidy with a API-compatible binary which uses the standard HTML5 parser mechanism. It's on my list of 'free time projects'. There are lots of HTML5 parser libraries now.

https://phabricator.wikimedia.org/T94890#1320571 has some timings for HTML5 libraries:

Timings for Obama (1.1M):

  • SAX parse via libxmljs (node) and no-op handlers: 64ms
  • XML DOM parse via libxmljs (node): 16ms
    • XPATH match for ID (ex: dom.find('//*[@id = "mw123"]')) : 15ms
    • XPATH match for class (ex: dom.find("//*[contains(concat(' ', normalize-space(@class), ' '), ' interlanguage-link ')]"): 34ms
  • HTML5 DOM parse via Mozilla's html5ever: 32ms
  • HTML5 DOM parse via domino (node): 220ms
Quiddity set Security to None.

Just updating status: T89331 seems to be the active task now. There are two implementations of "HTML5-based tidy" -- one in PHP, already committed to core as Balancer.php (part of T114445: [RFC] Balanced templates implementation), and one in Java named Depurate. Both have equivalent semantics; choosing between them in production is a matter of performance although the PHP implementation is likely to live on in core regardless as an option for those who need an "all in one" install.