Add title index to backup dumps
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	awight
	Feb 21 2011, 6:52 PM

Description

There are several readers available for mediawiki xml.bz2 dumps, some able to read the native format, and others which transform the data.

All suffer from there not being an index into this data. It is a major barrier to development and adoption by users.

The simplest remedy would be to register a dump filter which creates a text file mapping article title -> byte offset. If this is done during the backup process, there is almost no resource overhead.

I can write a patch if other developers agree this would be a worthwhile pursuit.

Version: unspecified
Severity: enhancement

Details

Reference: bz27618

Event Timeline

• bzimport raised the priority of this task from to High.Nov 21 2014, 11:27 PM

• bzimport added a project: Datasets-General-or-Unknown.

• bzimport set Reference to bz27618.

awight created this task.Feb 21 2011, 6:52 PM

(In reply to comment #0)

The simplest remedy would be to register a dump filter which creates a text
file mapping article title -> byte offset. If this is done during the backup
process, there is almost no resource overhead.

I can write a patch if other developers agree this would be a worthwhile
pursuit.

I'm interested. CCing Ariel for input and assigning to you. Let's have a patch!

How will this work for runs that do parts in parallel? I still don't know if those pieces should be recombined later but at present we are running on the assumption that they should be. Not a big issue, it's just that you'll need to write a little script to recalculate the byte offsets for the combined dump when that phase runs, keeping track of the bit alignment to get the page start byte in later pieces right.

This would be handy for a number of things actually, so I'd like to see it happen.

Interesting--
Also, the byte offsets are into the compressed data of course, ftell(STDOUT), and the boundaries between bz2 chunks also becomes very relevant.

Thanks, I'll have a patch for review this week!

Created attachment 8310
ROUGH

Not much to show yet, but in case someone wants to lend a hand...
My intention is that:

each backup job records the arguments with which it was invoked
an index entry is recorded for each page, giving its offset into the compressed data being generated

Problems:

there is no convention for saving to a second file stream (the index file)
bz2 php library does not expose the libbz2.so "tell" function, nor could that function work without flushing buffers. Perhaps the recorded offset can be addressed by bz2 chunk, then by uncompressed offset.

Attached:

write_index.0.patch1 KBDownload

*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*

sumanah wrote:

Adding the need-review keyword because my impression is that Adam wanted other developers to check his approach and give feedback. Thanks for the patch, Adam!

I like this idea and I think two things need to be added to this patch:

Currently only the title is written to the index file, but that should also included the namespace or use the page_id instead of the title.
As Ariel mentioned, we are generating the dumps in multiple parts so the index file should also keep track in which file the article can be found.

Best,

Diederik

Out of curiosity, what do the various bz2 offline readers need, byte, or byte and bit, or bzip2 boundary and offset?

I expect the offline readers don't really use namespace or page ids for anything, so adding the full page title (i.e. namespace:title) should suffice. If we're talking only about things in the main article space then it doesn't matter at all (but what about images?)...

Keisial wrote:

I used bzip2 boundary + title hash.
If your index is 315 MB, even dropping the ability to perform random search, you will hardly be efficient in a consumer PC with maybe just 512 MB of RAM.

sumanah wrote:

Adam, do you now have enough code review to revise your patch against current MediaWiki trunk? Thank you!

I like Ariel's solution in r107870 and r107839, are there plans to enable the multistream buildindex job on all dumps?

Yes but it's buggy. I need to get a bit of other crap off my plate and fix it first; then after a couple of stable runs I'll shove it out the door to the other projects.

This was enabled on all wikis quite some time back so closing :-)

ArielGlenn moved this task from Backlog to Done on the Datasets-General-or-Unknown board.Jul 14 2016, 7:06 PM

	F7545: write_index.0.patch
	Nov 21 2014, 11:27 PM

Add title index to backup dumpsClosed, ResolvedPublicActions

Description

Details

Event Timeline

Add title index to backup dumps
Closed, ResolvedPublic
Actions