Page MenuHomePhabricator

Wikidata JSON dump: better compression than gzip
Closed, ResolvedPublic

Description

I converted 20140721.json.gz to 20140721.json.xz and 20140721.json.bz2; gz is 2.9 GB, the other two were 2.0 GB. Saved space seems worth the effort.

For uncompression, which is what matters, xz uncompressed in 4 min vs. 2 min of gz. All the formats are supported natively by tar -af etc.; in recent versions, xz is parallel. I'm quoting from memory, because I killed the screen by mistake, but it seems LZMA/xz may be best choice.


Version: unspecified
Severity: normal

Details

Reference
bz68793

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:31 AM
bzimport set Reference to bz68793.

(In reply to Nemo from comment #0)

in recent
versions, xz is parallel

Source: http://sourceforge.net/p/lzmautils/discussion/708858/thread/d37155d1/#d8af (currently Ubuntu has liblzma 5.1.0alpha, fedora 20 has 5.1.2alpha).

This could also be implemented by offering several formats, as in the case of daily dumps. In this case, the URLs of the files should first be made more standard to help people find these files: Bug 70385 and Bug 68792.

Nemo_bis lowered the priority of this task from Medium to Low.Apr 9 2015, 7:20 AM
Nemo_bis set Security to None.

@hoo, do you want to have a go at this? It would require some small changes to your script(s) I think but that's all.

Bzip2 dumps are now being generated as per T115222: [Story] Compress JSON data dumps in Bzip2. Do we want to provide even more formats for download?

No, I think we're done here.