Page MenuHomePhabricator

Provide dumps using bittorrent
Open, MediumPublicFeature

Description

Without citing stats, these huge files demand multisourcing, either over HTTP using mirrors, or even better, using bittorrent. I hear this will dramatically improve bandwidth demand.

bittorrent is particularly nice, because files can be selectively downloaded from within the bundle. You could provide a single torrent containing all outputs from a particular wiki snapshot date.


Version: unspecified
Severity: enhancement

Details

Reference
bz27653

Related Objects

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:30 PM
bzimport set Reference to bz27653.

http://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps

As for the BitTorrent part, that would be somewhat feasible, having the tracker on WMF, but seeding from might be more of an issue

This is not an area I know much about, but what is the objection to seeding? I imagine you will get the maximum benefit by using an open tracker which is already tied into search services. And if your mirrors agree to use this protocol, they would provide a natural pool of seeders, even before they have finished replicating.

One major down side of the torrent idea is that it would be inefficient to offer incomplete dumps, because the .torrent would have to be changed as data grows. Unless there is a workaround, it would only make sense to wait until the dump is completed--by which point the data has aged...

Once the dump is available there is nothing preventing someone in the community or several someones from setting up a torrent of these files, and I encourage folks to do so (as has been done a number of times in the past).

Waiting til the dump is completed before adding it to a torrent is a good idea in all cases; only then are we sure that the files are intact and worth your while to download.

Folks that have talked with us about setting up a mirror site have expressed a preference for rsync, and that works best for us for distributing a subset of the dumps for mirroring.

Per Ariel comment I am closing this bug. Either set your own torrent or ask a rsync access.

Legoktm subscribed.

I'm re-opening this task because I think providing torrents is something that should be reasonably easy to integrate, and will provide enough benefits to users and the dumps infrastructure to be worth the amount of work.

I've been running https://tools.wmflabs.org/dump-torrents/ a little over a year now, which has been creating torrent files for all the dumps mirrored to toolforge, except it's started to run into performance problems with the NFS setup so I've paused it for now - the main instigator in suggesting a move to production.

With web seeds, downloaders will automatically download from multiple mirrors, distributing the bandwith required from hosters, and getting faster speeds. In addition, torrents have built-in integrity checking. I doubt many people will reseed torrents faster than any of the web seeds, but if it happens, it would be a nice bonus.

It's also super trivial (read: no major CPU resources needed) to add new web seeds/trackers to existing torrents once they've been created (source). Here is the mktorrent configuration I've been using, which seems to work reasonably well for most files, though we might want to consider adjusting the piece length based on the size of the dump.

I'm happy to work on integrating my existing code into the dumps infrastructure given a few pointers.

tl;dr: the win from torrents is automatically using multiple web seeds and distribution of load, not necessarily P2P.

This shouldn't run on the snapshot (dumps-generating) hosts; if it were to run anywhere it would run on the web server. Looping in @Bstorm who is the point person for the labstore boxes (which handle web service) now.

Can I hear a little about the performance problems you have been running into?

This shouldn't run on the snapshot (dumps-generating) hosts; if it were to run anywhere it would run on the web server.

Hmm, why wouldn't those hosts be the right place to call mktorrent? It can be CPU intensive, so I don't think running it on a web server is a good idea.
(I have very little understanding of how the actual dumps generating process works fwiw)

Can I hear a little about the performance problems you have been running into?

I know that it was overloading NFS, @Bstorm is the one who has more details on the specifics :)

Yep, it shot the NFS server up to a load avg of 20 all on its own. It didn't pin the CPU of the NFS server itself. The load was all network and IO related.

This shouldn't run on the snapshot (dumps-generating) hosts; if it were to run anywhere it would run on the web server.

Hmm, why wouldn't those hosts be the right place to call mktorrent? It can be CPU intensive, so I don't think running it on a web server is a good idea.
(I have very little understanding of how the actual dumps generating process works fwiw)

Those hosts write/read dumps to NFS, so that defeats the purpose of your moving the torrents. The dumpsdata boxes (that provide NFS filesystems) have lighter weight stats than the labstore boxes too.

Aklapper added subscribers: ArielGlenn, Aklapper.

@ArielGlenn: Hi, I'm resetting the task assignee due to inactivity. Please feel free to reclaim this task if you plan to work on this - it would be welcome! Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for more information - thanks!

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 12:24 PM
Aklapper removed a subscriber: Tfinc.