Page MenuHomePhabricator

Image tarball dumps on your.org are not being generated
Open, MediumPublic

Description

Image tarball dumps were being generated at http://ftpmirror.your.org/ on a somewhat monthly basis from April 2012 - December 2012. The last complete set
was in December 2012 [1]. A January 2013 post [2] indicated a hardware issue at your.org. A July 2013 post [3] indicated that the hardware issue was resolved at your.org, but further progress required a new setup due to the recent Wikimedia datacenter move.

[1] http://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/20121201/
[2] http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-January/000665.html
[3] http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-July/000861.html


Version: unspecified
Severity: normal

Details

Reference
bz51001

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:01 AM
bzimport set Reference to bz51001.

Ariel: Do you plan to take a look on this?

AH yes, sorry for not responding; I'm working on scripts that are independent of the specific media backend to handle the rsync.

(In reply to Ariel T. Glenn from comment #2)

I'm working on scripts that are independent of the specific media backend

Any news? :)

Nemo_bis changed the task status from Open to Stalled.Apr 9 2015, 7:16 AM
Nemo_bis set Security to None.

Is this still blocked on the lack of a rsync daemon for Your.Org to use?

Aklapper renamed this task from Image tarball dumps are not being generated to Image tarball dumps on your.org are not being generated.Nov 18 2016, 4:03 PM

Is this specifically about the tarballs or is http://ftpmirror.your.org/pub/wikimedia/images/ similarly affected? Given our tendency to lose image files (see T153565) it's pretty scary if there is no external backup for the files uploaded in the last 3 years.

These files are in the Swift filesystem so there are multiple copies of each file that is uploaded. There are no external copies of media uploaded since the move from a flat filesystem to Swift, afaik.

Swift copies are good for hardware errors but when there is a bug in the application code, all the copies get deleted (or, more likely, renamed to something that's hard to find).

I don't know if we'll bring back the tarballs but I do have a stealth project to get the rsyncable directory structure updated again. Expect it to take awhile. Script (in progress) lives off-site since it would never be run on WMF servers: https://github.com/apergos/mw_media_sync

This would only sync media actually in use on the projects, and it will be slow to catch up once it's written and running.

ArielGlenn changed the task status from Stalled to Open.Jun 26 2019, 1:49 PM
ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.

Some notes on architecture of the media sync scripts above:

  • The plan is that these would only every run on some primary mirror, so that other mirrors and media endusers could grab from there.
  • No local additional copy of media would be kept on Wikimedia servers. Swift already has enough copies.
  • Only original (unscaled) versions of the media would be provided, at least at first.
  • Only media in use on a Wikimedia project would be provided, so the bulk of what is uploaded to commons would not be synced. Plans should be made for a public mirror of all of commons but that's beyond the scope of these scripts.
  • Making lists of which files to delete locally and which to download on a per project basis is something that can easily be done at the public mirror end, since we publish periodic lists of images locally uploaded to projects and in use on the projects but housed on commons.
  • It's likely much better to request these files from our caches than directly from the swift backend, since there's the hope that some portion will be cached already. In particular, once we are caught up with the backlog, something that can be done slowly over time, new requests will be for files newly uploaded which might be cached? I should check with traffic about that.
  • The default wait time between retrievals is 5 seconds (configurable), with the plan to run this in serial only, one instance. I should check with traffic folks about that too; is this being overly cautious? Not cautious enough? The wait time is tuneable, and we could run in parallel by starting up separate instances per wiki if desired. In past discussions one process and a short wait has seemed acceptable but, best to check in again now that this idea is being revived.

There is current ongoing discussion to setup offline backups of media originals at T262669. While public dumps are not a priority of that project (backups are) it would be silly not to consider the possibility of also generating them with a similar workflow. @ArielGlenn was invited to the discussion there, as well as other mediawiki stakeholders will be asked (media, network, backups, operations, dumps).

If media tarballs become a real possibility, I may ask for broader input on what would be good exporting formats for reuse, but we are not yet there.

I wanted to update here that progress on media resiliency is ongoing, although no concrete promise can be made yet- this is a long-term project.

Whatever happened to media backups? Was an implementation decided on or even completed?

@ArielGlenn I just saw your question today- hopefully you saw the gradual updates at T262668 already :-D. Sadly, we had delays, mostly due to vendors providing the hw needed for the backups (it is a lot of disks over 4 hosts!).

Commonswiki is already 75% (originals) backed up (it takes weeks to do the first full backup) on an offline minio cluster on eqiad (codfw next), and I expect to finish it in a few days. I have yet to publish the final design and implementation of the backups.

As I think you knew (but commenting for others to get the context), because of the specific needs expressed by the several stakeholders, and the nature of mediawiki, we had to drop backups also serving as public exports (solving both problems/tickets at the same time)- but certainly what I learned and will document could be directly applied to exports design, or even be the source of the exports! Open for discussion as soon as I finish all initial full backups.

That's great to hear and I look forward to a future discussion!

Aklapper added subscribers: ArielGlenn, Aklapper.

@ArielGlenn: Hi, I'm resetting the task assignee due to inactivity. Please feel free to reclaim this task if you plan to work on this - it would be welcome! Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for more information - thanks!

I think all media files should be made available through IPFS. Then it would be easy to host a copy of files, or contribute to hosting part of a copy of files. You could pin files you are interested. And it would work like torrent, just that it is dynamic (new files can be added as they are uploaded, removed files can be unpinned by Wikimedia and can be hosted by others, or get lost by the IPFS). It could probably be made it so that Wikimedia does not have to host files twice, so that IPFS would use same files otherwise used for serving the web/API. This is something people behind IPFS are thinking about as well, so it could align: https://filecoin.io/store/#foundation I think this could help the fact that it is hard to make a static dump of all media files at the current size. So making this more distributed and fluid could help.