Wikimedia static HTML dumps broken
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	555
	Aug 2 2008, 11:02 PM

Description

Since the developers team have some exciting news (a donation for future usage [1] and new boxes have been ordered [2]), I'm finally comfortable in add this request: please make static dump HTML files for non-Wikipedia projects.

The biggest wikis from those projects have useful content, sometimes more useful than some small Wikipedias that have static dumps. Moreover, a HTML dump is one of the resources to spread the world about some Wikimedia projects :)

Best regards,
[[:m:User:555]]

[1] http://lists.wikimedia.org/pipermail/foundation-l/2008-July/044905.html

[2] http://lists.wikimedia.org/pipermail/wikitech-l/2008-August/038869.html

Version: unspecified
Severity: normal
URL: http://dumps.wikimedia.org/other/static_html_dumps/

Details

Reference: bz15017

Revisions and Commits

rODUM operations-dumps
	rODUMb90f7b46b44c [WIP] first draft python wrapper for html dumps
	rODUM73639fd8e0be first draft python wrapper for html dumps
	rODUM53d32611fd39 first draft python wrapper for html dumps
	rODUM3059c11a5654 first draft python wrapper for html dumps

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		Reedy	T281203 dumps distribution servers space issues
			Unknown Object (Task)
Resolved		Andrew	T302981 Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts
Resolved	Feature	Dzahn	T57503 Mirror more Kiwix downloads directories
Open		None	T88728 Improve Wikimedia dumping infrastructure
Declined		None	T17017 Wikimedia static HTML dumps broken
Resolved		RobH	T91853 Hardware for HTML / zim dumps
Declined		• GWicke	T93113 deploy francium for html/zim dumps
Resolved		• Cmjohnson	T93114 install 4 * 3TB disks in francium - sdc error
Resolved		RobH	T94093 Access to francium for gwicke,mobrovac,eevans (htmldumps-admins)
Invalid		ArielGlenn	T94457 Install nodejs, nginx and other dependencies on francium
Resolved		ArielGlenn	T93396 Decide on format options for HTML and possibly other dumps
Resolved		• GWicke	T97125 Determine service infra for HTML dumps

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Do note that static HTML dumps are no longer running at the moment (the last time was in 2008), but moving bug to the Datasets product nevertheless.

Assigning back to Nobody. Ariel isn't involved in the static HTML dumps.

Changing the bug summary from "Static HTML dumps for non-Wikipedia projects" to "Wikimedia static HTML dumps broken". This is more accurate of the current status.

Related mailing list thread: http://lists.wikimedia.org/pipermail/wikitech-l/2011-December/056752.html.

What's the status of this bug? Can the dumps please be updated? What's needed to make that happen? Is an RT ticket needed?

-easy

Not something a typical user can work on. (at least unless there's a list of known issues with the code that need fixing before this is restarted)

It would be useful if someone identified the bugs which actually block this request, among https://bugzilla.wikimedia.org/buglist.cgi?query_format=advanced&component=DumpHTML&resolution=---&product=MediaWiki%20extensions

"Currently, the extension is not really usable without fixing/tweaking the Mediawiki code."
http://www.kiwix.org/index.php/Mediawiki_DumpHTML_extension_improvement#2_-_Revamping_and_fixing_bugs

Restoring the "shell" keyword. Until a shell user tries to re-generate these dumps, it'll be impossible to know what the issues are (if any).

(In reply to comment #10)

Restoring the "shell" keyword. Until a shell user tries to re-generate these
dumps, it'll be impossible to know what the issues are (if any).

Seriously? Kelson is the de facto maintainer of dumpHTML as he runs it for Kiwix, and he says (like everyone else) that it's broken, see comment 9.
Also, this bug is 5+ years old. Nowadays, it should probably be repurposed to ask ZIM dumps, which make much more sense; we already have many but it would be nice if they were produced regularly on WMF servers without the laborious steps now necessary for Kelson.

We will rewrite/fix the DumpHTML extension in the next months. We have a granted project (by Wikimedia France) :
http://www.kiwix.org/index.php/Mediawiki_DumpHTML_extension_improvement

Nemo_bis awarded a token.Dec 12 2014, 8:46 AM

Nemo_bis added a parent task: T88728: Improve Wikimedia dumping infrastructure.Feb 17 2015, 7:14 PM

• GWicke subscribed.Feb 23 2015, 9:52 PM

We are ready to start manual HTML dumps using Parsoid HTML, using restbase and https://github.com/gwicke/htmldumper. The main thing we need to get started with this is a host with ~1.5-2T of disk space. CPU usage would be mainly from lzma compression. The dump itself is just HTTP requests and IO. It is also incremental by default. If we distribute a .tar.xz file, then clients can run the same dumper against restbase to incrementally update their dump to the latest state.

The first runs will likely be a manual affair, but we are happy to then automate things so that we can provide a new dump at regular intervals.

@ArielGlenn, could we use any of the existing snapshot hosts for this? Should we ask for new hardware?

• GWicke added projects: Services, RESTBase.Mar 5 2015, 6:36 PM

• GWicke set Security to None.

@GWicke the snapshot hosts are pretty booked up. We also don't have available space on the datasets host for the output (we prolly should order more storage).

@ArielGlenn, okay. Creating a sub-ticket for hardware.

• GWicke mentioned this in T91853: Hardware for HTML / zim dumps.Mar 7 2015, 1:14 AM

Actually, @ArielGlenn, would you prefer to expand the storage space on dataset hosts over serving the data straight from the dump hosts? The latter might involve less data movement, as we use the previous dumps to speed up the next incremental dump.

In T17017#1092941, @GWicke wrote:

We are ready to start manual HTML dumps using Parsoid HTML, using restbase and https://github.com/gwicke/htmldumper.

Fantastic that WMF works on this at last! But why not reuse existing tools, i.e. mwoffliner? https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/mwoffliner.js

@Nemo

Yes, what Gabriel talks about is not more than a tarball of mwoffliner Parsoid local cache (if mwoffliner uses one).
I think also we should find a way to do that together.

We should expand the data storage regoardless, and I would recommend not serving from the host where the dumps are created in any case.

In T17017#1097825, @Kelson wrote:

Yes, what Gabriel talks about is not more than a tarball of mwoffliner Parsoid local cache (if mwoffliner uses one).

Yup. Right now it's basically a directory per article, with a file named after the current revision inside of it. We could also consider some hashing scheme to avoid having millions of directories inside one directory.

I think also we should find a way to do that together.

Yes, definitely. Once we have up-to-date local caches of all articles per wiki we can run additional conversions using them.

Ricordisamoa subscribed.Mar 9 2015, 4:48 PM

• Mattflaschen-WMF unsubscribed.Mar 10 2015, 12:52 AM

scfc subscribed.Mar 10 2015, 8:43 AM

• GWicke moved this task from Backlog to Unnamed Column on the Services board.Mar 17 2015, 8:07 PM

• GWicke moved this task from Backlog to Future on the RESTBase board.Mar 17 2015, 8:16 PM

RobH added a subtask: T93113: deploy francium for html/zim dumps.Mar 18 2015, 6:29 PM

Hydriz subscribed.Mar 20 2015, 8:21 AM

• GWicke mentioned this in T92468: Services Roadmap April - June 2015 (Q4 2014/2015).Mar 23 2015, 5:04 PM

RobH closed subtask T91853: Hardware for HTML / zim dumps as Resolved.Mar 26 2015, 9:11 PM

Nitingupta910 subscribed.Mar 31 2015, 7:48 PM

Qichen.Tu subscribed.Apr 3 2015, 3:36 AM

ArielGlenn added a subtask: T97125: Determine service infra for HTML dumps.Apr 24 2015, 12:44 PM

Hardikj subscribed.Apr 24 2015, 10:35 PM

mgrabovsky subscribed.May 25 2015, 8:56 AM

• GWicke mentioned this in T102306: Services team roadmap July - September 2015 (Q1 2015/16).Jun 12 2015, 11:27 PM

Change 206849 had a related patch set uploaded (by Nemo bis):
first draft python wrapper for html dumps

https://gerrit.wikimedia.org/r/206849

gerritbot added a project: Patch-For-Review.Jun 13 2015, 4:12 PM

• GWicke mentioned this in T121240: Network isolation for production and semi-production services.Dec 11 2015, 5:38 PM

• Moushira added a project: Readers-Community-Engagement.Feb 10 2016, 5:19 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 10 2016, 5:19 PM

• Moushira moved this task from ToDo to On The Radar on the Readers-Community-Engagement board.Feb 10 2016, 5:19 PM

Meno25 unsubscribed.Feb 22 2016, 6:08 PM

Samat subscribed.Mar 5 2016, 8:30 PM

ArielGlenn mentioned this in T133547: set up automated HTML (restbase) dumps on francium.Apr 25 2016, 3:22 PM

To read Wikipedia offline on mobile e-readers with specific OSes (such as Sony DPTS1), the static HTML dump would be the only possible option.

In T17017#2551645, @martin.monperrus wrote:

To read Wikipedia offline on mobile e-readers with specific OSes (such as Sony DPTS1), the static HTML dump would be the only possible option.

Are you saying it's impossible to make a cross-platform application like Kiwix work on them?

gnosygnu subscribed.Aug 14 2016, 12:31 PM

Danny_B added a project: Dumps-Generation.Aug 14 2016, 12:51 PM

Danny_B subscribed.

Danny_B added commits: rODUM3059c11a5654: first draft python wrapper for html dumps, rODUM53d32611fd39: first draft python wrapper for html dumps, rODUM73639fd8e0be: first draft python wrapper for html dumps, rODUMb90f7b46b44c: [WIP] first draft python wrapper for html dumps.Aug 14 2016, 12:57 PM

In T17017#2551645, @martin.monperrus wrote:

To read Wikipedia offline on mobile e-readers with specific OSes (such as Sony DPTS1), the static HTML dump would be the only possible option.

We have ported to e-ink readers of Bookeen. Not sure about DPTS1 OS constraints... but I would not put it aside per default).

Liuxinyu970226 subscribed.Aug 14 2016, 2:13 PM

Are you saying it's impossible to make a cross-platform application like Kiwix work on them?

not possible without major hacking effort.

I'm saying that using static HTML pages seems to be a zero-effort on virtually all platforms

In T17017#2556754, @martin.monperrus wrote:

I'm saying that using static HTML pages seems to be a zero-effort on virtually all platforms

Only if we assume users of those platforms have unlimited disk space, proper filesystems, indexing and research capabilities, standard HTML clients, and more.

The HTML dumps were just that, dumps; they are not meant for end-users, mostly for backup/mirroring or other content reuse and for research/analysis.

I'm saying that using static HTML pages seems to be a zero-effort on virtually all platforms

I agree with Martin. I also want plain HTML dumps in a format that has plenty of tool support across platforms. SQLite (or XML) is that. ZIM is not even comparable.

Just plain HTML dumps would be so much better than any cooked up format. Plain HTML/XML gives so much flexibility and is so easy to parse with parsers available in almost any language or even cli.

Please please don't go for zim or anything like that.

mgrabovsky unsubscribed.Aug 24 2016, 11:04 PM

VilmarHillow subscribed.Sep 5 2016, 6:42 AM

• GWicke edited projects, added Services (later); removed Services.Oct 12 2016, 3:36 PM

• GWicke closed subtask T97125: Determine service infra for HTML dumps as Resolved.Oct 12 2016, 7:35 PM

jberkel subscribed.Nov 17 2016, 12:52 PM

ArielGlenn moved this task from Backlog to Blocked/Stalled/Waiting for event on the Dumps-Generation board.Dec 1 2016, 2:41 PM

This ticket is superceded by T133547 since the page content stored by RESTBase is HTML format. Those dumps should be in production relatively soon.

Restricted Application removed a subscriber: Liuxinyu970226. · View Herald TranscriptJan 17 2017, 11:56 AM

In T17017#2944760, @ArielGlenn wrote:

This ticket is superceded by T133547 since the page content stored by RESTBase is HTML format.

Except that there is no plan to provide a bare HTML solution, where one would extract the archive and open HTML files in a browser: T93396#1136904. So the use case for those dumps is clearly different from what used to be here. I guess that's for developers who want to build some software based on parsed wiki content, or for researchers who want to query the HTML? It was never clarified.

ArielGlenn moved this task from Blocked/Stalled/Waiting for event to Done on the Dumps-Generation board.Jan 30 2017, 2:07 PM

The use case for the HTML dumps would be similar to the use case for the current XML dumps: to provide Wikimedia content in a universal and easily accessible format (XML, JSON, SQLite). Wikimedia already provides a regular dumping service for the XML dumps. I think it's safe to say that most users of these XML dump would want HTML dumps as well. Does it matter whether or not the use case is clarified to be development or research or something else? The demand still exists, and has been demonstrated in this thread by three commenters.

At any rate, I'm confused by https://lists.wikimedia.org/pipermail/offline-l/2017-March/001396.html and its cited Phabricator tickets. Have these HTML dumps been put on hold indefinitely? If not, will these HTML dumps reach production anytime soon? I'd love to have a regular HTML dumping service just like the XML one.

+1 for HTML dumps. I work with Wiktionary XML dumps and getting the data out there is really tricky. A big portion of the content is generated via Scribunto and therefore not extractable from the XML alone.

I started to add some microformat-style markup into the generated HTML (see T138709). Having a HTML dump available would make this parsing step even simpler (and faster).

The most recent status update about the dev-friendly HTML dumps is T133547#2944779, I think; my message on the traditional format (a directory of HTML files) was triggered by the use case https://lists.wikimedia.org/pipermail/mediawiki-l/2017-March/046409.html

@Nemo_bis Thanks for the clarification. I agree that there really isn't a clear use-case for a directory of HTML files, particularly since the Wikimedia API allows users to get any HTML for any page. Full HTML dumps are much more useful, as there is no real bulk facility to get all this data (i.e.: get all the HTML for every page in English Wikipedia)

@jberkel In case it's at all useful to you, XOWA does generate HTML from dump XML using Lua / Scribunto. It uses its own non-MediaWiki parser, but the results are pretty accurate. If you're at all interested, please feel free to contact me separately.

Uh? A directory of files of course would be provided in a compressed archive, just like the 7z files at https://dumps.wikimedia.org/other/static_html_dumps/

And there is a very clear use case, i.e. making a static website. I already linked the use case. If you're ok with the dev-friendly formats, you can follow the respective tasks.

@Nemo_bis: Millions of files in a single directory tend to get unwieldy, so this wouldn't be very usable as the default distribution format. When researching format options I actually benchmarked a directory of files against sqlite, and it was *a lot* slower, at least on ext4. That said, if you still want that format it is fairly straightforward to extract files from an archive format or sqlite database with a small script. We should provide a set of such scripts for extraction to a file, old-style XML dump format, and possibly other formats.

The main thing that is missing is actually offering HTML dumps, but I know @ArielGlenn is working on making that (finally) happen.

Edit: See T93396#1188309 for the exact performance numbers, and that task in general for the format option discussion.

The main thing that is missing is actually offering HTML dumps, but I know @ArielGlenn is working on making that (finally) happen.

Excellent! Thanks a lot @ArielGlenn! Should we re-open this issue then?

@Nemo_bis Fair enough: I misunderstood your comment. It seems this task is about creating static .html files, such as those in the compressed archives at https://dumps.wikimedia.org/other/static_html_dumps/. I was more focused on a full HTML dump with an XML / SQLite / JSON format. In other words, something comparable to the current xml datadumps at https://dumps.wikimedia.org/backup-index.html, except in an HTML format.

At any rate, is T133547 now the active issue to track these HTML SQLite dumps for Wikimedia websites? Among other things, it seems to be the only open issue in the lot.

In T17017#2560704, @Nemo_bis wrote:

The HTML dumps were just that, dumps; they are not meant for end-users, mostly for backup/mirroring or other content reuse and for research/analysis.

Actually, I developed the DumpHTML extension for WiderNet eGranary. They were setting up non-internet-connected LANs in schools in sub-Saharan Africa and were looking for content to put on the fileservers. I found that inspiring, and decided to spend some time making it happen. As such, the HTML dumps were intended for direct end-user navigation, as you can see from the JavaScript "go" box I developed. I think Kiwix is doing an excellent job in this space today.

Of course, at the time (~2006) storing the complete dump on a phone was unthinkable. The clients were thin, but the fileserver was more than capable of handling large numbers of files and serving them over HTTP.

As such, the HTML dumps were intended for direct end-user navigation

Thanks for responding to that point: we definitely need some clarity on terminology. "Direct navigation" is what this task has in mind: you get a file, uncompress it and start navigating HTML pages as on any website. This is something that anyone can do, but that end-users usually enjoyed via some intermediary (e.g. someone making a web mirror, or local mirror as the LAN you describe). A large part of that is now best done via Kiwix.

Lately this task has been declined in favour of what I called a dev-friendly HTML dump, i.e. a dump which only contains HTML but that no end user at all is able to use directly simply by a series of clicks. The dev-friendly HTML dumps have their uses, which however don't satisfy what the original reporters of this bug had in mind. I would appreciate suggestions on how to call the "usual" HTML dumps in a way that avoids any confusion, since I believe they still have a purpose. I tried "directory of HTML files", "static HTML files", "static website" and so on, but with little success.

Providing a facility to turn a MediaWiki website into a static website (i.e. a website which doesn't require a database, a PHP interpreter or anything) is definitely less useful for Wikimedia wikis than it used to be before Kiwix (and mwoffliner with Parsoid), but would still be quite useful for non-Wikimedia wikis. Several sysadmins would be happy to uninstall MediaWiki while not breaking the web, as I know from my WikiTeam work.

In short: fixing dumpHTML is still useful. Restoring the usual 7z files with HTML inside is still desirable.

"Direct navigation" is what this task has in mind: you get a file, uncompress it and start
navigating HTML pages as on any website.

I confirm.

I'd recall that this task does not say that Kiwix does not support this usage. Kiwix is indeed
great. This task is about the usages where Kiwix is not an appropriate option: for instance on
platforms with no Kiwix or database support.

So yes, the HTML dumps can be considered useful for both end-users and devs.

In T17017#3106072, @GWicke wrote:

@Nemo_bis: Millions of files in a single directory tend to get unwieldy, so this wouldn't be very usable as the default distribution format.

Is there some requirement to use a single directory? You could just... not do that, right? We solved this problem for file uploads by using hash partials for the directory and subdirectory names (i.e., the /2/20/ in https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Chuck_Berry_1957.jpg/95px-Chuck_Berry_1957.jpg).

The main thing that is missing is actually offering HTML dumps, but I know @ArielGlenn is working on making that (finally) happen.

This is T133547: set up automated HTML (restbase) dumps on francium?

It's not clear to me whether resolving T133547 would solve @martin.monperrus' use-case(s).

Regarding https://www.mediawiki.org/wiki/Extension:DumpHTML, it seems like that extension should have a Phabricator tag/project in this installation, if it doesn't already, where we can track issues with that code. Tracking issues with DumpHTML seems only tangentially related to this task at this point (see directly below).

Regarding this task specifically, it seems like the Wikimedia Foundation (or Wikimedia) has shifted away from wanting to use DumpHTML, so declining this task ("Wikimedia static HTML dumps broken") seems reasonable enough because this task, as I understand it, was about getting DumpHTML to be able to run successfully on Wikimedia wikis.

While it's not a dump, per se, it's trivial to access the large, existing Varnish HTML cache layer to gather an HTML dump if anyone is interested. For most Wikimedia wikis, I imagine it would take less than a day to get a full dump of the rendered HTML of every page. You could even host this collection of HTML files on Wikimedia Labs or elsewhere as a service. Setting up HTML dumps like this does not require any involvement from the Wikimedia Foundation or Wikimedia.

In T17017#3116510, @MZMcBride wrote:

We solved this problem for file uploads by using hash partials for the directory and subdirectory names (i.e., the /2/20/ in https://upload.wikimedia.org/wikipedia/commons/thumb/2/20/Chuck_Berry_1957.jpg/95px-Chuck_Berry_1957.jpg).

DumpHTML does this already.

This is T133547: set up automated HTML (restbase) dumps on francium
https://phabricator.wikimedia.org/T133547?

It's not clear to me whether resolving T133547 https://phabricator.wikimedia.org/T133547 would
solve @martin.monperrus https://phabricator.wikimedia.org/p/martin.monperrus/' use-case(s).

as far as I understand, T133547 is technically related but does not address the problem discussed here.

This task (title and content) is the appropriate one, and it can be put in a more positive way:

"revive static HTML dumps"

Do we keep it as "Closed, Declined"?

In T17017#3116510, @MZMcBride wrote:

…
Regarding https://www.mediawiki.org/wiki/Extension:DumpHTML, it seems like that extension should have a Phabricator tag/project in this installation, if it doesn't already, where we can track issues with that code. Tracking issues with DumpHTML seems only tangentially related to this task at this point (see directly below).
…

MediaWiki-extensions-DumpHTML

awight subscribed.Apr 13 2017, 8:59 AM

• GWicke closed subtask T93113: deploy francium for html/zim dumps as Declined.Apr 19 2017, 9:28 PM

ArielGlenn moved this task from Backlog to Done on the Datasets-General-or-Unknown board.Apr 24 2017, 11:25 AM

bd808 mentioned this in T182351: Make HTML dumps available.Dec 9 2017, 5:30 AM