Page MenuHomePhabricator

Wikimedia static HTML dumps broken
Closed, DeclinedPublic

Description

Since the developers team have some exciting news (a donation for future usage [1] and new boxes have been ordered [2]), I'm finally comfortable in add this request: please make static dump HTML files for non-Wikipedia projects.

The biggest wikis from those projects have useful content, sometimes more useful than some small Wikipedias that have static dumps. Moreover, a HTML dump is one of the resources to spread the world about some Wikimedia projects :)

Best regards,
[[:m:User:555]]

[1] http://lists.wikimedia.org/pipermail/foundation-l/2008-July/044905.html

[2] http://lists.wikimedia.org/pipermail/wikitech-l/2008-August/038869.html


Version: unspecified
Severity: normal
URL: http://dumps.wikimedia.org/other/static_html_dumps/

Details

Reference
bz15017

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Do note that static HTML dumps are no longer running at the moment (the last time was in 2008), but moving bug to the Datasets product nevertheless.

Assigning back to Nobody. Ariel isn't involved in the static HTML dumps.

Changing the bug summary from "Static HTML dumps for non-Wikipedia projects" to "Wikimedia static HTML dumps broken". This is more accurate of the current status.

Related mailing list thread: http://lists.wikimedia.org/pipermail/wikitech-l/2011-December/056752.html.

What's the status of this bug? Can the dumps please be updated? What's needed to make that happen? Is an RT ticket needed?

-easy

Not something a typical user can work on. (at least unless there's a list of known issues with the code that need fixing before this is restarted)

It would be useful if someone identified the bugs which actually block this request, among https://bugzilla.wikimedia.org/buglist.cgi?query_format=advanced&component=DumpHTML&resolution=---&product=MediaWiki%20extensions

"Currently, the extension is not really usable without fixing/tweaking the Mediawiki code."
http://www.kiwix.org/index.php/Mediawiki_DumpHTML_extension_improvement#2_-_Revamping_and_fixing_bugs

Restoring the "shell" keyword. Until a shell user tries to re-generate these dumps, it'll be impossible to know what the issues are (if any).

(In reply to comment #10)

Restoring the "shell" keyword. Until a shell user tries to re-generate these
dumps, it'll be impossible to know what the issues are (if any).

Seriously? Kelson is the de facto maintainer of dumpHTML as he runs it for Kiwix, and he says (like everyone else) that it's broken, see comment 9.
Also, this bug is 5+ years old. Nowadays, it should probably be repurposed to ask ZIM dumps, which make much more sense; we already have many but it would be nice if they were produced regularly on WMF servers without the laborious steps now necessary for Kelson.

We will rewrite/fix the DumpHTML extension in the next months. We have a granted project (by Wikimedia France) :
http://www.kiwix.org/index.php/Mediawiki_DumpHTML_extension_improvement

We are ready to start manual HTML dumps using Parsoid HTML, using restbase and https://github.com/gwicke/htmldumper. The main thing we need to get started with this is a host with ~1.5-2T of disk space. CPU usage would be mainly from lzma compression. The dump itself is just HTTP requests and IO. It is also incremental by default. If we distribute a .tar.xz file, then clients can run the same dumper against restbase to incrementally update their dump to the latest state.

The first runs will likely be a manual affair, but we are happy to then automate things so that we can provide a new dump at regular intervals.

@ArielGlenn, could we use any of the existing snapshot hosts for this? Should we ask for new hardware?

@GWicke the snapshot hosts are pretty booked up. We also don't have available space on the datasets host for the output (we prolly should order more storage).

@ArielGlenn, okay. Creating a sub-ticket for hardware.

Actually, @ArielGlenn, would you prefer to expand the storage space on dataset hosts over serving the data straight from the dump hosts? The latter might involve less data movement, as we use the previous dumps to speed up the next incremental dump.

We are ready to start manual HTML dumps using Parsoid HTML, using restbase and https://github.com/gwicke/htmldumper.

Fantastic that WMF works on this at last! But why not reuse existing tools, i.e. mwoffliner? https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/mwoffliner.js

@Nemo

Yes, what Gabriel talks about is not more than a tarball of mwoffliner Parsoid local cache (if mwoffliner uses one).
I think also we should find a way to do that together.

We should expand the data storage regoardless, and I would recommend not serving from the host where the dumps are created in any case.

Yes, what Gabriel talks about is not more than a tarball of mwoffliner Parsoid local cache (if mwoffliner uses one).

Yup. Right now it's basically a directory per article, with a file named after the current revision inside of it. We could also consider some hashing scheme to avoid having millions of directories inside one directory.

I think also we should find a way to do that together.

Yes, definitely. Once we have up-to-date local caches of all articles per wiki we can run additional conversions using them.

Change 206849 had a related patch set uploaded (by Nemo bis):
first draft python wrapper for html dumps

https://gerrit.wikimedia.org/r/206849

martin.monperrus subscribed.

To read Wikipedia offline on mobile e-readers with specific OSes (such as Sony DPTS1), the static HTML dump would be the only possible option.

To read Wikipedia offline on mobile e-readers with specific OSes (such as Sony DPTS1), the static HTML dump would be the only possible option.

Are you saying it's impossible to make a cross-platform application like Kiwix work on them?

To read Wikipedia offline on mobile e-readers with specific OSes (such as Sony DPTS1), the static HTML dump would be the only possible option.

We have ported to e-ink readers of Bookeen. Not sure about DPTS1 OS constraints... but I would not put it aside per default).

Are you saying it's impossible to make a cross-platform application like Kiwix work on them?

not possible without major hacking effort.

I'm saying that using static HTML pages seems to be a zero-effort on virtually all platforms

I'm saying that using static HTML pages seems to be a zero-effort on virtually all platforms

Only if we assume users of those platforms have unlimited disk space, proper filesystems, indexing and research capabilities, standard HTML clients, and more.

The HTML dumps were just that, dumps; they are not meant for end-users, mostly for backup/mirroring or other content reuse and for research/analysis.

I'm saying that using static HTML pages seems to be a zero-effort on virtually all platforms

I agree with Martin. I also want plain HTML dumps in a format that has plenty of tool support across platforms. SQLite (or XML) is that. ZIM is not even comparable.

Just plain HTML dumps would be so much better than any cooked up format. Plain HTML/XML gives so much flexibility and is so easy to parse with parsers available in almost any language or even cli.

Please please don't go for zim or anything like that.

This ticket is superceded by T133547 since the page content stored by RESTBase is HTML format. Those dumps should be in production relatively soon.

This ticket is superceded by T133547 since the page content stored by RESTBase is HTML format.

Except that there is no plan to provide a bare HTML solution, where one would extract the archive and open HTML files in a browser: T93396#1136904. So the use case for those dumps is clearly different from what used to be here. I guess that's for developers who want to build some software based on parsed wiki content, or for researchers who want to query the HTML? It was never clarified.

The use case for the HTML dumps would be similar to the use case for the current XML dumps: to provide Wikimedia content in a universal and easily accessible format (XML, JSON, SQLite). Wikimedia already provides a regular dumping service for the XML dumps. I think it's safe to say that most users of these XML dump would want HTML dumps as well. Does it matter whether or not the use case is clarified to be development or research or something else? The demand still exists, and has been demonstrated in this thread by three commenters.

At any rate, I'm confused by https://lists.wikimedia.org/pipermail/offline-l/2017-March/001396.html and its cited Phabricator tickets. Have these HTML dumps been put on hold indefinitely? If not, will these HTML dumps reach production anytime soon? I'd love to have a regular HTML dumping service just like the XML one.

+1 for HTML dumps. I work with Wiktionary XML dumps and getting the data out there is really tricky. A big portion of the content is generated via Scribunto and therefore not extractable from the XML alone.

I started to add some microformat-style markup into the generated HTML (see T138709). Having a HTML dump available would make this parsing step even simpler (and faster).

The most recent status update about the dev-friendly HTML dumps is T133547#2944779, I think; my message on the traditional format (a directory of HTML files) was triggered by the use case https://lists.wikimedia.org/pipermail/mediawiki-l/2017-March/046409.html

@Nemo_bis Thanks for the clarification. I agree that there really isn't a clear use-case for a directory of HTML files, particularly since the Wikimedia API allows users to get any HTML for any page. Full HTML dumps are much more useful, as there is no real bulk facility to get all this data (i.e.: get all the HTML for every page in English Wikipedia)

@jberkel In case it's at all useful to you, XOWA does generate HTML from dump XML using Lua / Scribunto. It uses its own non-MediaWiki parser, but the results are pretty accurate. If you're at all interested, please feel free to contact me separately.

Uh? A directory of files of course would be provided in a compressed archive, just like the 7z files at https://dumps.wikimedia.org/other/static_html_dumps/

And there is a very clear use case, i.e. making a static website. I already linked the use case. If you're ok with the dev-friendly formats, you can follow the respective tasks.

@Nemo_bis: Millions of files in a single directory tend to get unwieldy, so this wouldn't be very usable as the default distribution format. When researching format options I actually benchmarked a directory of files against sqlite, and it was *a lot* slower, at least on ext4. That said, if you still want that format it is fairly straightforward to extract files from an archive format or sqlite database with a small script. We should provide a set of such scripts for extraction to a file, old-style XML dump format, and possibly other formats.

The main thing that is missing is actually offering HTML dumps, but I know @ArielGlenn is working on making that (finally) happen.

Edit: See T93396#1188309 for the exact performance numbers, and that task in general for the format option discussion.

The main thing that is missing is actually offering HTML dumps, but I know @ArielGlenn is working on making that (finally) happen.

Excellent! Thanks a lot @ArielGlenn! Should we re-open this issue then?

@Nemo_bis Fair enough: I misunderstood your comment. It seems this task is about creating static .html files, such as those in the compressed archives at https://dumps.wikimedia.org/other/static_html_dumps/. I was more focused on a full HTML dump with an XML / SQLite / JSON format. In other words, something comparable to the current xml datadumps at https://dumps.wikimedia.org/backup-index.html, except in an HTML format.

At any rate, is T133547 now the active issue to track these HTML SQLite dumps for Wikimedia websites? Among other things, it seems to be the only open issue in the lot.

The HTML dumps were just that, dumps; they are not meant for end-users, mostly for backup/mirroring or other content reuse and for research/analysis.

Actually, I developed the DumpHTML extension for WiderNet eGranary. They were setting up non-internet-connected LANs in schools in sub-Saharan Africa and were looking for content to put on the fileservers. I found that inspiring, and decided to spend some time making it happen. As such, the HTML dumps were intended for direct end-user navigation, as you can see from the JavaScript "go" box I developed. I think Kiwix is doing an excellent job in this space today.

Of course, at the time (~2006) storing the complete dump on a phone was unthinkable. The clients were thin, but the fileserver was more than capable of handling large numbers of files and serving them over HTTP.

As such, the HTML dumps were intended for direct end-user navigation

Thanks for responding to that point: we definitely need some clarity on terminology. "Direct navigation" is what this task has in mind: you get a file, uncompress it and start navigating HTML pages as on any website. This is something that anyone can do, but that end-users usually enjoyed via some intermediary (e.g. someone making a web mirror, or local mirror as the LAN you describe). A large part of that is now best done via Kiwix.

Lately this task has been declined in favour of what I called a dev-friendly HTML dump, i.e. a dump which only contains HTML but that no end user at all is able to use directly simply by a series of clicks. The dev-friendly HTML dumps have their uses, which however don't satisfy what the original reporters of this bug had in mind. I would appreciate suggestions on how to call the "usual" HTML dumps in a way that avoids any confusion, since I believe they still have a purpose. I tried "directory of HTML files", "static HTML files", "static website" and so on, but with little success.

Providing a facility to turn a MediaWiki website into a static website (i.e. a website which doesn't require a database, a PHP interpreter or anything) is definitely less useful for Wikimedia wikis than it used to be before Kiwix (and mwoffliner with Parsoid), but would still be quite useful for non-Wikimedia wikis. Several sysadmins would be happy to uninstall MediaWiki while not breaking the web, as I know from my WikiTeam work.

In short: fixing dumpHTML is still useful. Restoring the usual 7z files with HTML inside is still desirable.

"Direct navigation" is what this task has in mind: you get a file, uncompress it and start
navigating HTML pages as on any website.

I confirm.

I'd recall that this task does not say that Kiwix does not support this usage. Kiwix is indeed
great. This task is about the usages where Kiwix is not an appropriate option: for instance on
platforms with no Kiwix or database support.

So yes, the HTML dumps can be considered useful for both end-users and devs.

@Nemo_bis: Millions of files in a single directory tend to get unwieldy, so this wouldn't be very usable as the default distribution format.

Is there some requirement to use a single directory? You could just... not do that, right? We solved this problem for file uploads by using hash partials for the directory and subdirectory names (i.e., the /2/20/ in https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Chuck_Berry_1957.jpg/95px-Chuck_Berry_1957.jpg).

The main thing that is missing is actually offering HTML dumps, but I know @ArielGlenn is working on making that (finally) happen.

This is T133547: set up automated HTML (restbase) dumps on francium?

It's not clear to me whether resolving T133547 would solve @martin.monperrus' use-case(s).

Regarding https://www.mediawiki.org/wiki/Extension:DumpHTML, it seems like that extension should have a Phabricator tag/project in this installation, if it doesn't already, where we can track issues with that code. Tracking issues with DumpHTML seems only tangentially related to this task at this point (see directly below).

Regarding this task specifically, it seems like the Wikimedia Foundation (or Wikimedia) has shifted away from wanting to use DumpHTML, so declining this task ("Wikimedia static HTML dumps broken") seems reasonable enough because this task, as I understand it, was about getting DumpHTML to be able to run successfully on Wikimedia wikis.

While it's not a dump, per se, it's trivial to access the large, existing Varnish HTML cache layer to gather an HTML dump if anyone is interested. For most Wikimedia wikis, I imagine it would take less than a day to get a full dump of the rendered HTML of every page. You could even host this collection of HTML files on Wikimedia Labs or elsewhere as a service. Setting up HTML dumps like this does not require any involvement from the Wikimedia Foundation or Wikimedia.

We solved this problem for file uploads by using hash partials for the directory and subdirectory names (i.e., the /2/20/ in https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Chuck_Berry_1957.jpg/95px-Chuck_Berry_1957.jpg).

DumpHTML does this already.

This is T133547: set up automated HTML (restbase) dumps on francium
https://phabricator.wikimedia.org/T133547?

It's not clear to me whether resolving T133547 https://phabricator.wikimedia.org/T133547 would
solve @martin.monperrus https://phabricator.wikimedia.org/p/martin.monperrus/' use-case(s).

as far as I understand, T133547 is technically related but does not address the problem discussed here.

This task (title and content) is the appropriate one, and it can be put in a more positive way:

"revive static HTML dumps"

Do we keep it as "Closed, Declined"?


Regarding https://www.mediawiki.org/wiki/Extension:DumpHTML, it seems like that extension should have a Phabricator tag/project in this installation, if it doesn't already, where we can track issues with that code. Tracking issues with DumpHTML seems only tangentially related to this task at this point (see directly below).

MediaWiki-extensions-DumpHTML