We would like to make the contents of wikidata.org available as a dump using our canonical JSON format. The maintenance script for doing this is
extensions/Wikibase/repo/maintenance/dumpJson.php
This will send a JSON serialization of all data entities to standard output, so I suppose that would best be piped through bz2.
This should work as-is, but there are several things that we should look out for or try out:
- I don't know how long it will take to make a complete dump. I expect that it'll be roughly the same as making an XML dump of the current revisions.
- I don't know how much RAM is required. Currently, all the IDs of the entities to output will be loaded into memory (by virtue of how the MySQL client library works) - that's a few dozen million rows. AS a guess, 1GB should be enough.
- We may have to make the script more resilient to sporadic failures, especially since a failure would currently mean restarting the dump.
- Perhaps sharding would be useful: the script supports --sharding-factor and --shard to control how m,any shards there should be, and which shard the script should process. Combining the output files is not as seamless as it could be, though (it involves chopping off lines at the beginning and the end of files).
Version: unspecified
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=57015