Page MenuHomePhabricator

Dumps should be incremental
Open, MediumPublicFeature

Description

Author: tedks

Description:
I'm a user who wants to have a complete dump of English Wikipedia (or any other large wiki-project).

I could download a complete dump every time one was made, but that would be a lot of bandwidth to use to re-download mostly the same fileset, and it'd be expensive for both me and the Wikimedia Foundation.

The easiest solution to this that I (in my total ignorance) can see is having a base archive, and releasing diff archives after that that just have the changed/added files (like a duplicity backup). These incremental archives would be a small fraction of the total dump in terms of space and bandwidth, and would make keeping a dump current much easier.


Version: unspecified
Severity: enhancement

Details

Reference
bz28956

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:30 PM
bzimport set Reference to bz28956.

re-assigning to Ariel as she is the one responsible for backups. Ariel, thoughts?

Yes, actually I have a number of thoughts on the issue, but first we have to deal with the "political" side of the issue, which is that deleted revisions get deleted or oversighted for a reason, and if we only produce incremental dumps on a regular basis, those revisions don't get removed from what's produced. At least, they wouldn't with the existing system. We should talk about the consequences of that. We might be dealing with copyrighted material which has since been removed, or information that identifies a user; those are the two big cases in my mind. The way the dumps are supposed to work is that eventually we don't make the old copies public any more; space is reused, and so downloaders pick up the new files. Of course if someone wanted to keep a copy of the old files they could, but in practice that doesn't happen for the en dumps, as we saw several months ago when we had the server outage.

On the one hand it's sort of like security through obscurity; we're relying on good will and inconvenience more than anything else to make the system work. OTOH it's maybe better than ignoring the issue. Thoughts?

We would still need to generate fulls of course on a regular basis, and I am guessing that we would need to provide a script that would merge incrementals with fulls, since page moves and deletion information would need to be included in the incrementals.

(In reply to comment #2)

On the one hand it's sort of like security through obscurity; we're relying on
good will and inconvenience more than anything else to make the system work.
OTOH it's maybe better than ignoring the issue. Thoughts?

I see your concerns. Where should this be discussed?

We would still need to generate fulls of course on a regular basis,

My thought, probably naive, is that the diffs between these fulls may be sufficient for Ted's original request. Probably not strictly source code diffs (though, now I am curious), but maybe a log of changes could be created that would be more compact than the full dump but still allow a person to take the last image they had up to the latest.

Hrmm... I smell a project if no one else has tried this yet.

(In reply to comment #2)

On the one hand it's sort of like security through obscurity; we're relying on
good will and inconvenience more than anything else to make the system work.
OTOH it's maybe better than ignoring the issue. Thoughts?

I see your concerns. Where should this be discussed?

We would still need to generate fulls of course on a regular basis,

My thought, probably naive, is that the diffs between these fulls may be sufficient for Ted's original request. Probably not strictly source code diffs (though, now I am curious), but maybe a log of changes could be created that would be more compact than the full dump but still allow a person to take the last image they had up to the latest.

Hrmm... I smell a project if no one else has tried this yet.

Since November 2011, incremental dumps (also known as add/change dumps) are available at http://dumps.wikimedia.org/other/incr/.

However, this feature is still being marked as experimental, so there is no guarantee that it works fully.

Marking this bug as resolved.

Actually the "incrementals" are adds/changes dumps, as documented at http://wikitech.wikimedia.org/view/Dumps/Adds-changes_dumps

A bit more needs to be done before I would consider them equivalent to incrementals, so repoening.

Ariel has proposed this feature request as a Google Summer of Code Project at
http://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#XML_dumps

We have accepted it and a shorter version is now listed at https://www.mediawiki.org/wiki/Summer_of_Code_2013#Incremental_data_dumps

Pasting here the recommendation for a implementation from Ariel, just in case:

This could be achieved by designing the right output format for the XML files containing text for all revisions. It would need: a smart choice for compression of multiple items together, an index into the compressed blocks, a way to remove content quickly, possibly leaving zeroed blocks bhind, a way to re-use empty blocks. To use the new archive format, we would need tools to convert to bz2 or 7z (so users can keep all their existing scripts for the dumps), a format for storing isolated sets of changes (so dump users can download just these sets), a script to apply those changes to the above format (so users can run the script against the change set and their full dump to update their copy). It would likely need to take as input an XML file of new pages and new revisions for old pages, as well as a list of pages and/or revisions that have been deleted in the meantime; this would entail no changes to MediaWiki core, all of the work would be done by a separate set of tools.

brain dump of thoughts here: http://www.mediawiki.org/wiki/User:ArielGlenn/Dumps_new_format_%28deltas,_changesets%29 not meant to be binding in any way, some of it likely to be pure crapola too.

Just a note to say that user Wywin has applied to GSoC with a proposal related to this report. Good luck!

https://www.mediawiki.org/wiki/User:Wywin

So I had (not knowing about Wywin's) also started working on a proposal for this project too... [[mw:User:Legoktm/a]] (still in drafting).

Competition is good! Having more students after a single project idea might look like having less chances to be acepted. However, it also shows a high interest in the idea itself and therefore helps us promoting it among other ideas proposed.

All this to say: keep working on a great proposal and good luck!

PS: this is one of the reasons why we encourage candidates to share their plans in the community channels as soon as possible:
https://www.mediawiki.org/wiki/Mentorship_programs/Application_template

This project is starting to get crowded: I have also added my own proposal: [[mw:User:Svick/Incremental dumps]].

Just a note to say that Svick has submitted his proposal officially.

Legoktm, you are encouraged to apply to the GSoC tool asap.

Ariel, please sign up as possible mentor for all these proposals in the GSoC tool. We are recommending two co-mentors per project, based on previous experiences. All the better if a second co-mentor joins.

I found Svick's proposal via wikitech-l and highlighted on the talk page a hope that this would support remote incremental via API. See comment at

https://www.mediawiki.org/wiki/User_talk:Svick/Incremental_dumps#Consider_remote_backups_as_well_26965

I'm sharing it here so others can see and comment as well. Allowing remote differentials would allow for services (like WikiApiary, Archiveteam and others) to provide a robust backup service to large numbers of wikis.

Just a note to say that Shao Hong has submitted a GSoC proposal related with this report: https://www.mediawiki.org/wiki/User:Shaohong

(In reply to comment #10)

So I had (not knowing about Wywin's) also started working on a proposal for
this project too... [[mw:User:Legoktm/a]] (still in drafting).

Just a note to confirm that Legoktm has submitted a GSoC proposal related to this report: https://www.mediawiki.org/wiki/User:Legoktm/GSoC_2013

And yes, we have received Jeremy's proposal as well.

Good luck to all candidates!

questpc wrote:

Why not just select revision.rev_timestamp range to dump only revisions created during some time interval (let's say day or week)?
It should not be too hard to implement cli options to select timestamp ranges for maintenance/dumpBackup.php. Then such dumps can be made by Wikimedia.

Just a note to say that Sanja Pavlovic has submitted a GSoC proposal related to this report: https://www.mediawiki.org/wiki/User:Sanja_pavlovic/GSOC/OPW_application

Good luck!

questpc wrote:

Even full dumps could be performed into separate files per day or per week, so the dump operation will be full and incremental at the same time. It will produce multi-file dump, however maintenance/importDump.php also could be modified to import from such multi-file dumps.

The biggest problem is slowness of xml dumps, so SQL dumps also should be created in such way.

(In reply to comment #20)

The biggest problem is slowness of xml dumps, so SQL dumps also should be
created in such way.

If I inderstand you correctly, you're suggesting that the text revisions be dumped using e.g. mysqldump in order to make them faster. While the production of the XML dumps for WMF projects is very slow for large projects, using mysqldump isn't feasible, for a few reasons:

  • Text revisions live in external storage clusters in separate databases and tables. Older revisions might live in a different cluser than newer ones. For any given revision the way to find out where the text content is stored is to check the pointer in the wikis's text table.
  • Some text revisions are hidden from public view (deleted or oversighted) and should not be included in the dumps.
  • We have all of the metadata that should accompany the text of each page, for bot users, researchers and importers alike. This is a convenience measure more than anything else but a vary popular one. Of course if there were some other proposal for packaging the metadata in the glorious new dump format to come, this issue could be addressed.

questpc wrote:

Sure, Wikimedia installations are very special so SQL dumps are out of question for them.

But actually my idea is very different. I suggest to split XML dumps into daily files automatically during backup / import process:
wpen_2013-05-04.xml
wpen_2013-05-05.xml
wpen_2013-05-06.xml

Of course when full day passed by and it's dump file already exists, then such daily dumps should not be re-created, just quickly skipped out.

In case there will be too many XML files, one may either use nested directory tree:
wpen/2013/05/04.xml
wpen/2013/05/05.xml
wpen/2013/05/06.xml

or to perform weekly dumps (number of week in the year):
wpen_2013-01.xml
wpen_2013-02.xml
...
wpen_2013-52.xml

(In reply to comment #22)

Are you aware of the adds/changes dumps which are basically a daily dump of new revsions (without however notification about deletions etc)?

questpc wrote:

No, I didn't knew one can perform such dumps. Which options of maintenance/dumpBackup.php are used?

pass 1: php -q dumpBackup.php --wiki=somewikiorother --stub --quiet --force-normal --output=gzip:somestubname.xml.gz --revrange --revstart somerevnum --revend otherrevnum

pass 2: php -q dumpTextPass.php --wiki=somewikiorother --stub=gzip:somestubname.xml.gz --force-normal --quiet --spawn=php --output=bzip2:somerevisioncontentname.xml.bz2

The trick is to have the starting and ending revision ids for your range. Do note again that this does not address deleted/hidden revisions etc.

questpc wrote:

It's not fully automated. I propose a solution where revrange is taken automatically by script itself and dumps are automatically split into daily / weekly files and also importDump.php could import such files.

If I'll had more time and not so much extreme shortage of money I'd make such patch, but unfortunately I can't. I am not well enough to become Wikimedia developer, while MediaWiki freelancing here in Russia turned out to be financial disaster (not much of MediaWiki related jobs here). So, currently I turned to coding for another frameworks and do not know whether to return back. But I still read Wikitech.

I have now started working on this, for more information and updates, see [[mw:User:Svick/Incremental dumps]].

GSoC is over and the code is mostly done (repo operations/dumps/incremental, branch gsoc). There are some remaining bugs (bug 64633) and TODOs (https://www.mediawiki.org/wiki/User:Svick/Incremental_dumps/TODO). After that is done, the code should be ready for production.

Typo in last comment: the correct link for remaining bugs is bug 54633.

If you have open tasks or bugs left, one possibility is to list them at https://www.mediawiki.org/wiki/Google_Code-In and volunteer yourself as mentor.

We have heard from Google and free software projects participating in Code-in that students participating in this programs have done a great work finishing and polishing GSoC projects, many times mentores by the former GSoC student. The key is to be able to split the pending work in little tasks.

More information in the wiki page. If you have questions you can ask there or you can contact me directly.

Moving this to Dumps-Rewrite project; no further work on this will be done on the current dumps.

Aklapper added a subscriber: ArielGlenn.

@ArielGlenn: Hi, I'm resetting the task assignee due to inactivity. Please feel free to reclaim this task if you plan to work on this - it would be welcome! Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for more information - thanks!

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 12:24 PM
Aklapper removed a subscriber: Tfinc.