Replace Tidy with a library that doesn't suck
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	matmarex
	Sep 25 2013, 11:01 PM

Description

I mean, just look at Tidy

Tidy is awful and we need to get rid of it. There's gotta be a less crappy library that will just close unclosed HTML tags without messing with their contents.

See Also:

T55784: [EPIC] Use Parsoid HTML for all page views

Details

Reference: bz54617

Related Objects

Mentioned In: T89331: Replace HTML4 Tidy in MW parser with an equivalent HTML5 based tool
T55784: [EPIC] Use Parsoid HTML for all page views
Mentioned Here: T114445: [RFC] Balanced templates
T55784: [EPIC] Use Parsoid HTML for all page views
T89331: Replace HTML4 Tidy in MW parser with an equivalent HTML5 based tool

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:15 AM

• bzimport added a project: MediaWiki-Parser.

• bzimport set Reference to bz54617.

• bzimport added a subscriber: Unknown Object (MLST).

matmarex created this task.Sep 25 2013, 11:01 PM

Brion suggested that switching the parser to use Parsoid would make HTMLTidy unnecessary.

So the main reason we went with Tidy in the first place I think was to ensure that we had well-formed (X)HTML output, so XML parsers wouldn't die and browsers wouldn't do exciting things if you had a stray </div> somewhere.

The core parser tries to do some HTML cleanup, but it was never very complete.

Possibilities include:

a) Fix the HTML fixups in the core parser, and make sure non-Tidy output is compatible with the current Tidy output

b) Replace Tidy with another tool that's less annoying

c) Replace the core parser with something that already outputs valid HTML5 (such as Parsoid)

Long-term I like c) but I don't think we're there yet. :)

a) and b) are the things I'd recommend looking at if we really want to kill tidy in the short/medium term.

We hope to be at a point where we can consider using Parsoid output for regular page views by next summer. See https://www.mediawiki.org/wiki/Parsoid/Roadmap.

In Parsoid, an HTML5 treebuilder provides the bulk of the required clean-up. We also approximate the PHP / tidy parser's deviation from the standard cleanup in custom passes to make sure that the semantics of content written against the current setup are preserved.

You can't possibly want to require every MediaWiki installation everywhere to use Parsoid? The node.js dependency is unacceptable in most scenarios.

Rephrasing summary to reflect that we don't intend to get rid of fixing unclosed tags, but Tidy specifically (we shouldn't kill Tidy without adding something else, so that makes the bug more "atomic")

Parsoid would only be needed for wikitext editing and -templating. HTML-only wikis would basically serve XHTML straight from storage.

(In reply to comment #6)

Parsoid would only be needed for wikitext editing and -templating. HTML-only
wikis would basically serve XHTML straight from storage.

You can't possibly want to require every MediaWiki installation everywhere to switch to editing raw HTML by hand (VE depends on Parsoid…).

(In reply to comment #7)

You can't possibly want to require every MediaWiki installation everywhere to
switch to editing raw HTML by hand (VE depends on Parsoid…).

VE is an HTML editor, so can be used without Parsoid.

(In reply to comment #8)

VE is an HTML editor, so can be used without Parsoid.

Well yeah, okay, this could work. VE, however, has certain software and hardware requirements not all computers meet. And there's the entire issue of "templating" which you dismissed with a single word, which I assume is currently not implemented without wikitext backing it.

VE also currently doesn't work for, say, talk pages (and please don't mention Flow, it will not be ready by next summer) or edit summaries, and there are certain pieces of the interface which show raw source code like diffs (I don't think anybody has implemented rich text diffs yet in MediaWiki, but this is something I'd really like to see).

Using Parsoid for page view is just not workable in short or mid term, no matter how much we would want it.

/offtopic

(In reply to comment #9)

Using Parsoid for page view is just not workable in short or mid term, no
matter how much we would want it.

Which issues do you see apart from rendering quality / compatibility?

(In reply to comment #10)

Which issues do you see apart from rendering quality / compatibility?

Compatibility/availability is the single showstopper issue here. I can't run server-side JavaScript on most free hostings.

(In reply to comment #11)

(In reply to comment #10)

Which issues do you see apart from rendering quality / compatibility?

Compatibility/availability is the single showstopper issue here. I can't run
server-side JavaScript on most free hostings.

So it's back to the policy question of what MediaWiki is intended to be - a great wiki for large- and medium-scale wikis, or a hodge-podge of tools which are limited by ease of download-the-zip-file installation over a proper management tool, rather than by what is best for users?

MediaWiki is intended to be both, if you ask me. I don't see how your question is relevant to the bug, since I am not proposing to make it a hodge-podge.

I run into too many problems because of Tidy. It's main flaw is that it is not compatible with HTML5; it hasn't been updated since 2008(!). Most problems stem from Tidy not allowing any block elements inside inline elements (which is allowed in HTML5), and kicks them out which results in broken HTML, even though its goal is to prevent exactly that.

Is there no library that has the same functionality and is up to date?

Fount a lilbrary called HTML Purifier, but that's more of a 'evil code' filter with some 'Tidy inspired' features. Probably not what we want.

There is also tidy-html5 [1], a fork that aims for full HTML5 support.

[1] https://github.com/w3c/tidy-html5

(In reply to Bartosz Dziewoński from comment #11)

(In reply to comment #10)

Which issues do you see apart from rendering quality / compatibility?

Compatibility/availability is the single showstopper issue here. I can't run
server-side JavaScript on most free hostings.

Nor can you typically run tidy there. Virtual machines are really cheap these days (starting at about $30 / year), so cost is no longer the issue that prevents people from installing better tools for the job. Missing packaging is another point, but that is also being addressed (parsoid is now debianized).

In any case, we are working on being ready to start using Parsoid HTML for normal page views this summer. We might not want to maintain the PHP parser in the longer term, and are thus less likely to spend much effort on replacing tidy right now.

(In reply to Gabriel Wicke from comment #16)

Nor can you typically run tidy there.

Citation needed. http://www.php.net/manual/en/book.tidy.php It's definitely more likely to be accessible than having node and being able to shell out.

Virtual machines are really cheap
these days (starting at about $30 / year), so cost is no longer the issue
that prevents people from installing better tools for the job.

$30 is not within the reach of everyone. There's also the fact that you have to have a credit card to get any reputable paid hosting, and that's also not a given in the whole world.

(In reply to Bartosz Dziewoński from comment #17)

$30 is not within the reach of everyone. There's also the fact that you have
to have a credit card to get any reputable paid hosting, and that's also not
a given in the whole world.

Depending on your use case there are also free options like Wikia and other non-profit options without ads. Free shared hosting is not automatically going to be more reputable than free VM hosting, nor do I see systematic differences in payment methods.

You are free to work on MediaWiki on shared hosting of course. All I'm saying is that there are few remaining reasons for us to

spend major resources on shared hosting support, and
let it hold back our architectural development at the expense of security, performance and maintainability

At some point I would like to replace tidy with a API-compatible binary which uses the standard HTML5 parser mechanism. It's on my list of 'free time projects'. There are lots of HTML5 parser libraries now.

Jdforrester-WMF mentioned this in T55784: [EPIC] Use Parsoid HTML for all page views.Dec 7 2014, 7:42 PM

Progress! T89331: Replace HTML4 Tidy in MW parser with an equivalent HTML5 based tool

tidy-html5 seems active
https://github.com/htacg/tidy-html5/commits/master

https://phabricator.wikimedia.org/T94890#1320571 has some timings for HTML5 libraries:

Timings for Obama (1.1M):

SAX parse via libxmljs (node) and no-op handlers: 64ms
XML DOM parse via libxmljs (node): 16ms
- XPATH match for ID (ex: dom.find('//*[@id = "mw123"]')) : 15ms
- XPATH match for class (ex: dom.find("//*[contains(concat(' ', normalize-space(@class), ' '), ' interlanguage-link ')]"): 34ms
HTML5 DOM parse via Mozilla's html5ever: 32ms
- full round-trip with serialization: 60ms
- implemented in Rust, which means that it can be wrapped in a binary PHP or node module (C ABI)
HTML5 DOM parse via domino (node): 220ms

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 31 2015, 10:17 PM

Ricordisamoa subscribed.Aug 8 2015, 11:10 PM

ssastry moved this task from Backlog to In Progress on the MediaWiki-Parser board.Dec 17 2015, 5:55 PM

Danny_B added a project: Tidy.May 28 2016, 10:31 AM

Danny_B removed a subscriber: • wikibugs-l-list.

Danny_B removed a parent task: T4542: [DO NOT USE] HTML Tidy issues (tracking) [superseded by the #Tidy tag].May 28 2016, 10:50 AM

Krinkle unsubscribed.Jun 2 2016, 8:21 PM

Quiddity updated the task description. (Show Details)Nov 4 2016, 7:08 PM

Quiddity set Security to None.

Just updating status: T89331 seems to be the active task now. There are two implementations of "HTML5-based tidy" -- one in PHP, already committed to core as Balancer.php (part of T114445: [RFC] Balanced templates implementation), and one in Java named Depurate. Both have equivalent semantics; choosing between them in production is a matter of performance although the PHP implementation is likely to live on in core regardless as an option for those who need an "all in one" install.

Krinkle removed a subtask: T89331: Replace HTML4 Tidy in MW parser with an equivalent HTML5 based tool.Jul 19 2017, 8:34 PM

Krinkle closed this task as a duplicate of T89331: Replace HTML4 Tidy in MW parser with an equivalent HTML5 based tool.

Replace Tidy with a library that doesn't suckClosed, DuplicatePublicActions

Description

Details

Related Objects

Event Timeline

Replace Tidy with a library that doesn't suck
Closed, DuplicatePublic
Actions