Page MenuHomePhabricator

The bzip2 python stuff is really ugly. Maybe parts should be redone in C.
Closed, ResolvedPublic

Description

There are a couple of files of low level awful bzip2 stuff in python we need for beeing to seek around and find block boundaries in ginormous files. The python code is ewwww gross. We would likely be better off hacking up the bzip2 library and using that instead of the current python interface to the standard bzip2 library. Question is how worth it is it to invest more time into that. Low priority for now.


Version: unspecified
Severity: enhancement

Details

Reference
bz27126

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:23 PM
bzimport set Reference to bz27126.

Is this about dbzip2 or something else? There are several other projects that do pretty much what dbzip2 did, and there should be some that are better maintained and better-performing these days...

Ah no, specifically I mean my python bzip2 stuff that I will shortly be using to do things like find the last pageid in a truncated bzip2 history file by seeking to near the end of the file and grabbing it. It works, ok... but it is gross.

However! Please clue me in if there are parallel bzip2 projects further along than dbzip2; we might be looking at them for other reasons.

Parallel bzip2 (http://compression.ca/pbzip2/) might be interesting to speed up the compression process.

well there is a whole project lying around for this, please see:

http://www.mediawiki.org/wiki/Dbzip2

I've been looking for suckers^Wvolunteers to poke at it...

done, if not perfect, and a huge improvement over the python stuff I had.