Page MenuHomePhabricator

Set up generation of JSON dumps for wikidata.org
Closed, ResolvedPublic

Description

We would like to make the contents of wikidata.org available as a dump using our canonical JSON format. The maintenance script for doing this is

extensions/Wikibase/repo/maintenance/dumpJson.php

This will send a JSON serialization of all data entities to standard output, so I suppose that would best be piped through bz2.

This should work as-is, but there are several things that we should look out for or try out:

  • I don't know how long it will take to make a complete dump. I expect that it'll be roughly the same as making an XML dump of the current revisions.
  • I don't know how much RAM is required. Currently, all the IDs of the entities to output will be loaded into memory (by virtue of how the MySQL client library works) - that's a few dozen million rows. AS a guess, 1GB should be enough.
  • We may have to make the script more resilient to sporadic failures, especially since a failure would currently mean restarting the dump.
  • Perhaps sharding would be useful: the script supports --sharding-factor and --shard to control how m,any shards there should be, and which shard the script should process. Combining the output files is not as seamless as it could be, though (it involves chopping off lines at the beginning and the end of files).

Version: unspecified
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=57015

Details

Reference
bz54369

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:04 AM
bzimport set Reference to bz54369.

Addendum: use the --output command line option to specify an output file instead of using stdout. This enabled progress information and error reports to be written to stdout.

steps are:

  • try it manually (e.g. screen session) on terbium with test.wikidata
  • try it manually with wikidatawiki
  • figure out where the output will go (somewhere on dumps.wikimedia.org)
  • setup cronjob to have it run periodically and automatically.

If someone could give me the command (with args) to run this on test.wikidata, I'll do that on terbium.

/usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/dumpJson.php --wiki wikidatawiki --output wikidata.json

optionally the sharding parameters can be used to allow the script to go faster:

e.g.

--shard 2 --sharding-factor 3

Note that the script doesn't fork itself for sharding. with --sharding-factor 3, you'll need 3 cron jobs (possibly on different boxes) with --shard 0, --shard 1, and --shard 2, respectively.

But for now, we should try without sharding; collecting the output for all shards into a single file would need some post-processing anyway.

the script needs to be able to pipe to bzip while sending error elsewhere

Three options bur compression & error reporting:

  1. don't specify --output - then it'll write to stdout, and you can bzip it. Progress and error reporting is silenced, though.
  1. use PHP's bzip2 stream wrapper: --output compress.bzip2://wikidata.json.bz2
  1. make dumpJson.php always write errors to stderr; then it's no longer important whether you use --output or not.

option #2 or #3 could work. I would try #2.

silencing progress and error reporting is not an acceptable option, in my opinion

While I think option #2 (using the stream wrapper for compression) would work, it gives no control over compression parameters.

I have filed bug 57015 for option #3; I think it would be nice to have that. But please go ahead and try and set up the dump script already, using the stream wrapper. There's no reason to wait for the logging options.

--output compress.bzip2://wikidata.json.bz2 is not doing it for me, I'm getting

Warning: fopen(compress.bzip2://wikidata.json.bz2): failed to open stream: operation failed in /usr/local/apache/common-local/php-1.23wmf3/extensions/Wikibase/repo/maintenance/dumpJson.php on line 102
[fd9eae8e] [no req] Exception from line 105 of /usr/local/apache/common-local/php-1.23wmf3/extensions/Wikibase/repo/maintenance/dumpJson.php: Failed to open compress.bzip2://wikidata.json.bz2!

@Ariel: maybe you just don't have write permission there? or php doesn't have bzip2 support build in? anyway...

I have made a patch introducing a --log option to control where log messages go, see I561a003.

I have also found and fixed a bug that caused invalid JSON in case an entity couldn't be loaded/dumped, see Ief7664d6.

I guess we have to wait for these to get deployed. Or just backport them, these patches are nicely local.

I was writing into my home directory on terbium, so surely I had permissions. Anyways, once the --log option is in, this is a moot point.

Just an update, after 22 hours of running against wikidata, we are at 202400 entities dumped. So sharding is going to be necessary; please start looking into what it would take to have one cron job for this and whatever prost-processing might be needed as well. Alternatively, are there speedups possible in the script?

The dump concluded at Sat Dec 7 13:57:32 UTC 2013 with 221400 entities. File size (bz2 compressed): 98 mb.

The dump *concluded* with 221400 entities dumped? That's... wrong. We have more than 10 million entities on wikidata.org. Any idea how to best investigate this?

Also, what you report seems *extremely* slow. It took about 10 seconds for each entity (20k additional entities between the 5th and the 7th)? Wow...

There are no obvious points for speedup, but I could do some profiling. One thing that could be done of course is to not load the data from the database, but instead process an XML dump. Would that be preferable to you?

Hi Daniel, how quickly does it run in your tests? Do you have the full dataset available via the XML dump?

Creating a dump for the 349 items i have in the db on my laptop takes about 2 seconds. These are not very large, but then, most items on wikidata are not large either (while a few are very large).

> time php repo/maintenance/dumpJson.php > /dev/null

Processed 100 entities.
Processed 200 entities.
Processed 300 entities.
Processed 349 entities.
 
real	0m2.385s
user	0m1.996s
sys	0m0.088s

All data is available in the XML dumps, but we'd need two passes (for the first pass, a dump of the property namespace would be sufficient). I don't currently have a dump locally.

The script would need quite a bit of refactoring to work based on XML dumps; I'd like to avoid that if we are not sure this is necessary / desirable.

I don't think we currently have a good way to test with a large data set ourselves at the moment. Importing XML dumps does not really work with wikibase (this is an annoying issue, but not easy to fix).

addendum to my comment above: I suspect one large factor is loading the json from the external store. Is there a way to optimize that? We are only using the latest revision, so grouping revisions wouldn't help....

Still, going from 100+ items per second to 10 seconds per item is surprising, to say the least.

Processed 221200 entities.
Processed 221300 entities.
Processed 221400 entities.
Sat Dec 7 13:57:32 UTC 2013

that's the end of the output from the job (I still have the screen session on terbium).

I'm writing bz2 compressed output. You're writing to /dev/null. That's going to be a big difference.

Here's the start of the last item in the dumped output:

{"id":"Q235312","type":"item","descriptions":...

221447 lines in the uncompressed file, total size of 1.1gb uncompressed.

After I tested this myself on terbium, I found out that the PHP script is constantly leaking memory... I think this is because Wikibase is "smart" and statically caches all entities ever requested.

Another thing we noticed is that it's apparently not getting all entity ids from the query, it would probably be wise to batch the query getting the entity ids.