Page MenuHomePhabricator

300GB of files on a hard disk in need of a URI or WMF direct upload
Closed, DeclinedPublic

Description

The project to upload of 100,000 high quality medical related images from the Wellcome Library is reaching a critical point. I have been offered a hard disk with the high resolution images as there are security and bandwidth issues hosting these on the Library's servers for mass upload. I believe these would be around 300GB in total.

I can host these from home, but my upload would be probably have to be limited to 50 or 80GB maximum per month, even then my broadband provider might object. I am unsure if we want to white-list home built temporary FTP servers, but that would be needed for this solution to work.

Is there a Wikimedia solution for uploading the files? If the WMF covered postage, I could post the hard disk to operations in the USA to direct upload (or expose as a virtual drive so that the GWT can upload these to Commons).

Suggestions welcome. No particular rush as I can get on with some initial uploads, however doing the majority of the upload in advance of Wikimania in August would be handy to use this as a case study.


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=48205
https://rt.wikimedia.org/Ticket/Display.html?id=8007

Details

Reference
bz67477

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:31 AM
bzimport set Reference to bz67477.
bzimport added a subscriber: Unknown Object (MLST).

That's a very good question, as mailing the hard drive would probably be the most efficient option. I'll bring it up on the Ops & Multimedia mailing lists.

By the way, what format and size are the images? (averages and extremes).

My solution was https://bugzilla.wikimedia.org/show_bug.cgi?id=48205. Postage did cost a fair bit but this had been anticipated (and was paid for) by the GLAM institution in question.

Wellcome test image

Test image while discussing upload options. The source for a lower resolution version is at http://wellcomeimages.org/indexplus/image/L0034718.html, including the CC-BY license.

Attached:

L0034718.jpg (2×2 px, 3 MB)

Size and format:

Added test jpeg image above. I think this is an good example small file at 3mb, I would speculate that images may be 5x to 10x larger than this. It will depend on the sub-library/digitization project. I'll be able to work it out once we have the hard disk released (they are arranging to make it at the moment). If you open the jpeg you can see it has good quality EXIF data giving it context, though I would use the Wellcome online catalog to create the image pages as well as linking to the main catalog pages in addition to the image library (two separate things).

There are no plans to upload other formats, so we should not have thumbnail creation problems for this batch upload project.

I have yet to consider how to add useful Commons categories, this will be something for me to experiment with, probably along similar lines to my Department of Defense uploads... a topic to discuss on the COM:BATCH project page.

(In reply to Gilles Dubuc from comment #1)

That's a very good question, as mailing the hard drive would probably be the
most efficient option. I'll bring it up on the Ops & Multimedia mailing
lists.

It's been done before, and can be done again.

As long as ops are aware, so it can be correctly received at a data centre (presumably EQIAD)

I suspect WMF can arrange carriage too

Fæ, what country are you in?

Sam et al;

I'm in London (UK) as are the Wellcome group I'm dealing with. A contact and address would be timely now, can someone decide how that will work?

The Wellcome expect to have the disk ready on the 14th, so if it could get to Ops and uploaded a week later, I might have a chance to get the uploads underway or even complete before Wikimania. I'm presuming that the files can be uploaded and available as a virtual disk that the GWToolset can treat as a set of URLs to generate the Commons image pages.

The files will be as the example attached previously. I also have a test set of 700 locally so that I can get ready for the actual run.

Category for the 730 example files is at:
http://commons.wikimedia.org/wiki/Category:Wellcome_images_%28test_set%29

Note that my cropping of the credit bar from images also loses the EXIF data, this will not be true of the 100,000 to be uploaded as these should not have the bar.

(In reply to Fæ from comment #8)

The Wellcome expect to have the disk ready on the 14th, so if it could get
to Ops and uploaded a week later, I might have a chance to get the uploads
underway or even complete before Wikimania. I'm presuming that the files can
be uploaded and available as a virtual disk that the GWToolset can treat as
a set of URLs to generate the Commons image pages.

I would presume such an upload would use importImages.php (would require all uploads on the disk to be accompinied by a .txt file which has wikitext for what should go on the image page.)

It really would be better done independent of the image pages being generated. This is just an upload of a several numbered directories full of files.

I do not know the contents of the disk in advance, and if this is to be a GWToolkit upload (which is what we really want), then the idea is to to host the files somewhere on a WMF server so that I can generate an xml file (using their listing) which includes URLs to the internal files and supplies the metadata from the Wellcome catalog which is correlated to template parameters.

If there is some reason that the files cannot be uploaded to a (temporary) folder on a server without creating all the Commons pages at the same time, then I could generate a special text file (or text files for every image, which I presume is what would be needed for 'importImages.php'), but this means I need to receive the Wellcome disk first at home and hold on to it for (probably) a couple of weeks (as it requires a bit more non-standard fiddling about), before posting the reworked disk to the WMF myself. Less ideal, as I doubt that I could have much of the batch upload done in advance of Wikimania.

I have received the disk from the Wellcome Trust. It's a pocked sized USB drive. There are just over 50,000 of medical history high resolution image files (I'm not sure how they chose this set) which are sorted into sub-folders based on ranges of image numbers. There appears to be 144 GB to be uploaded, so smaller than originally estimated.

Could I have a single point of contact to move this forward? It would be possible to get these done before Wikimania, but I need to confirm on who to post the disk to in the USA (and whether the WMF will pay for postage) and whether the image page texts need to be included, or whether my proposal to load the files and have the GWToolset then create image pages for them can be worked out.

(In reply to Fæ from comment #12)

whether the image page texts need to be included, or whether my proposal to
load the files and have the GWToolset then create image pages for them can
be worked out.

have you made a local copy of the 144GB that you can refer to in order to make descriptions? (after mailing the disk)

my initial thought when I heard you wanted to do this with gwtoolset was that we could mount the disk and copy all the content to a tool labs tool and then gwtoolset could get it.

But I guess bug 68264 makes that impossible.

you need to either find a developer to work with you on figuring out gwtoolset+local HDD (good luck) or IMO your best bets to get this done in time are:

  • send disk and arrange for upload with pregenerated txt files (no gwtoolset) or
  • upload directly to someplace that can allow use of gwtoolset. e.g. s3 (my back of napkin says 7 days * 2 mbit is ~147GB. is impossible to find a connection with 5 or 10 mbit upstream?)

(In reply to jeremyb from comment #13)

you need to either find a developer to work with you on figuring out
gwtoolset+local HDD (good luck) or IMO your best bets to get this done in
time are:

  • send disk and arrange for upload with pregenerated txt files (no

gwtoolset) or

  • upload directly to someplace that can allow use of gwtoolset. e.g. s3 (my

back of napkin says 7 days * 2 mbit is ~147GB. is impossible to find a
connection with 5 or 10 mbit upstream?)

Gwtoolset has rate limitting in it, so that would slow things down.

It would probably be much more efficient and faster to do option (a).

Option a is "send disk and arrange for upload with pregenerated txt files"

To whom can such disks be sent?

For reference and comparison, a non-technical request for help in doing mega-uploads was made at https://meta.wikimedia.org/wiki/Grants:IdeaLab/Commons_and_backlog_buster_speed_upgrades

I do not know what options are available for people who have lots of media to upload and difficulty uploading it.

(In reply to Lane Rasberry from comment #15)

Option a is "send disk and arrange for upload with pregenerated txt files"

To whom can such disks be sent?

Either direct to datacenter (if there's a suitable carrier and is arranged with ops in advance) or else proxied through WMF office.

Is that the option you want? Does Fae have his own copy of the 144GB?

Following up with Jeremy by email, but confirming here:

Yes, I will take a copy of the files and keep them in the UK, as well as posting in the USB pocket disk (via Royal Mail).

The Wellcome are investigating why there were 50,000 files rather than 100,000; there may be a need to upload a second tranche of another 50,000.

If we can "unlock" the need for the text files before uploading that would be handy, though as postage takes 7 days, that waiting time might be used to create the image pages (or the GWToolset XML equivalent) and share it over the internet rather than delaying postage.

( Commons categorization will be a challenge, though I'm considering this being a post-upload bot task to refine crude categories with better ones, rather than attempting to solve all the details up front. )

shipment is RT 8007. Fae is generating file description pages now and those can be uploaded to to tool labs to be used with the images on the HDD.

Who wants to take the upload side of things?

I can help out with uploading them once the disk is hooked up.

The disk has been posted. It can be tracked at https://www.royalmail.com/track-your-item with number LY706436566GB.

Over 19,000 of the 50,000+ images have text files next to them in the directories (all under "Wellcome images"). I will finish the rest of the text files and will probably load that as a zip file to this thread, unless someone suggests another way of doing it.

(In reply to Fæ from comment #20)

The disk has been posted. It can be tracked at
https://www.royalmail.com/track-your-item with number LY706436566GB.

Over 19,000 of the 50,000+ images have text files next to them in the
directories (all under "Wellcome images"). I will finish the rest of the
text files and will probably load that as a zip file to this thread, unless
someone suggests another way of doing it.

Note, there is a 10 megabyte limit to bugzilla uploads. If the zip of the file descriptions end up being too big, I'm sure we can stick them on tool labs somewhere.

(In reply to Bawolff (Brian Wolff) from comment #21)

(In reply to Fæ from comment #20)

...

Note, there is a 10 megabyte limit to bugzilla uploads. If the zip of the
file descriptions end up being too big, I'm sure we can stick them on tool
labs somewhere.

I am now 80% done in generating the text files, it will probably be ready by Friday or Saturday. In a non-compressed directory these look like they will hit around 200MB, perhaps more. I'm happy to use a free filesharing service or do something with tool labs (if someone can email me with a pointer on how best to do that).

Based on Royal Mail's estimated delivery times, the disk should be with the WMF early next week.

(In reply to Fæ from comment #22)

I am now 80% done in generating the text files, it will probably be ready by
Friday or Saturday. In a non-compressed directory these look like they will
hit around 200MB, perhaps more. I'm happy to use a free filesharing service
or do something with tool labs (if someone can email me with a pointer on
how best to do that).

Actually if you're comfortable using Google Drive or Dropbox (or anything that I can access with wget), those would be preferred to tool labs. Production can't access anything on labs, so I'd have to download the file locally and then upload it to prod. Not a huge deal, but if it's convenient for you to use not-labs, that would be appreciated :)

Updates -

  • I have been in discussion with the Wellcome Trust and they are creating another hard disk including the remaining images. The total upload we are planning should be the 100,000 images.
  • I had a blunder today moving files (good grief!) and now have to recreate the remaining image text pages. The disk posted to the WMF has text files for nearly 20,000 of the 50,000 images on the disk. Considering the time left before Wikimania, this will have to do for the first tranche and any announcement at Wikimania.
  • According to the postage tracking, the disk has arrived. So I'm hoping that the first tranche upload can be complete before the London hackerthon (starts Wednesday).

I believe the disk with the first tranche of 50,000 images has been sitting with Sam for a fortnight. Any update? The Wellcome are keen to get on with this project.

It should be noted that due to an absence of any process on this way of uploading large collections of images, this has taken so long that it would have been almost as effective if I had stuck to uploading the collection through my home broadband.

Rather than continuing to disappoint the Wellcome (I've had a couple of chase-up contacts now), I am proceeding with upload via my home broadband with about 10% completed over the last week. It should take me around 5 or 6 weeks this slow way, however the delay in attempting to upload by disk has now been over 2 months and we have still not succeeded despite 2 disks being posted and a third waiting at home (2 USB disks purchased by the Wellcome and the third by the WMF).

Based on this experience, I think we have to advise GLAMs and others who may be interested in disk based donations, rather than using the GWToolset or similar, that there is no process for ensuring this happens.

I am disappointed that there is not a process in place for the Wikimedia Foundation staff to provide support for accepting large GLAM uploads. This donation is one of the most generous to be given in the history of the Wikimedia movement and it is unfortunate that we do not have infrastructure to promise more than sketchy and unpredictable responses to valuable potential partners.

I do not find any fault with anyone because we have not identified anyone at the Wikimedia Foundation whose responsibility it is to provide community support for media donations, but there is a problem here in that extremely valuable donations surfaced by Wikimedia community members are not getting respect or acknowledgment in the expected channels for assistance.

Again, no one is to blame for this but this does put a burden on the community to make a demand for change and acknowledgement. I wish that community relationships were not so poorly regarded, and again I emphasize - this is a systemic communication problem, and not the fault of any one or any group.

GLAM deal with chapters (an other local organisations). It's their role to provide this kind of services IMO, not really the one of the WMF. Offering this is not rocket science (easy to have a server in a datacenter for example). In Switzerland WMCH does it and it works well. I have already proposed to Fae to help and I'm sure WMCH would be happy to again consider to help if necessary and possible.

@Kelson We are all harmed when we fail to perform basic functions in the view of our most valuable partners. Perhaps it is not the role of the WMF to provide these services but by design of our community it is always the WMF role to look bad in the eyes of the public when anything in any context goes wrong. To the extent that it is possible, I would like for the WMF and the larger Wikimedia community to seem functional and I regret when there are problematic relationships with critical partners.

If WMCH would like to take an uploading role for content on physical media then I can imagine that would be appreciated. It is always a hassle to manage large media uploads, and completely impossible to do in developing countries with slow Internet. It would be awesome if some chapter would post a mailing address and say, "Send us anything. Here are our formatting guidelines."

In this case, Fae can speak for himself about whether help is necessary, but this issue will arise again and it would be nice to have best practices in place.

(In reply to Lane Rasberry from comment #27)

I am disappointed that there is not a process in place for the Wikimedia
Foundation staff to provide support for accepting large GLAM uploads. This
donation is one of the most generous to be given in the history of the
Wikimedia movement and it is unfortunate that we do not have infrastructure
to promise more than sketchy and unpredictable responses to valuable
potential partners.

I do not find any fault with anyone because we have not identified anyone at
the Wikimedia Foundation whose responsibility it is to provide community
support for media donations, but there is a problem here in that extremely
valuable donations surfaced by Wikimedia community members are not getting
respect or acknowledgment in the expected channels for assistance.

Again, no one is to blame for this but this does put a burden on the
community to make a demand for change and acknowledgement. I wish that
community relationships were not so poorly regarded, and again I emphasize -
this is a systemic communication problem, and not the fault of any one or
any group.

Id be interested in hearing what happened here from the wmf point of view. There was rumours floating around that the disks were sent incorrectly and never made it to the wmf, at least the first time around. (is that true?). In the past this sort of thing has been handled quite easily, so what happened this time.

(In reply to Kelson [Emmanuel Engelhart] from comment #28)

GLAM deal with chapters (an other local organisations). It's their role to
provide this kind of services IMO, not really the one of the WMF. Offering
this is not rocket science (easy to have a server in a datacenter for
example). In Switzerland WMCH does it and it works well. I have already
proposed to Fae to help and I'm sure WMCH would be happy to again consider
to help if necessary and possible.

As far as I know, nor WMCH or any other chapter has access to the Wikimedia datacenters to perform a Server-side upload [1], which was what Fæ was looking for here.

[1] https://commons.wikimedia.org/wiki/Help:Server-side_upload

Trying to summarize the discussion in the RT ticket, "Sam should have a second disc, the first was lost in the post." and afterwards the RT ticket was closed as "The rest can be handled in bugzilla", both written on August 26.

Regarding the request for updates, I'd assume that Sam might know?

(In reply to Bawolff (Brian Wolff) from comment #30)
...

Id be interested in hearing what happened here from the wmf point of view.
There was rumours floating around that the disks were sent incorrectly and
never made it to the wmf, at least the first time around. (is that true?).
In the past this sort of thing has been handled quite easily, so what
happened this time.

Executive summary: Disk sent correctly to USA, disk received correctly, disk "lost" by WMF goods-in supplier.

The Wellcome USB disk was sent and received correctly as legally verified by Royal Mail (with a perfectly correct address as advised by the WMF, I have a photograph of the package, and with postage paid for by WikiprojectMed). The WMF appears to have a long term issue with its goods-in management contractor, as if a delivery is not through one of their preferred "friendly" U.S. delivery services, they manage to always lose the package. In the U.K. this would be considered racketeering and I'm amazed that the WMF has put up with this behaviour from their supplier for a period of years.

I do not know who the Royal Mail use as a supplier for deliveries in California, if the WMF insist that all deliveries must happen through named delivery agents due to a "closed shop" policy from her supplier, I suggest that the she pays for the delivery rather than expecting GLAM institutions to collaborate in activities that may fail to meet UK charity law, and be more expensive than standard international Royal Mail from the UK.

A second disk WMF provided was sent within the UK (postage paid by me). Due to the length of time passing with no update or responses, I have gone ahead and started a slow direct upload instead. At the time of writing I'm around 20% complete so as mentioned earlier this is a much more efficient and effective process than hoping the WMF can help out as had I done this in the first place I would have completed the upload at least a calendar month ago.

Executive summary: Disk sent correctly to USA, disk received correctly, disk
"lost" by WMF goods-in supplier.

The Wellcome USB disk was sent and received correctly as legally verified by
Royal Mail (with a perfectly correct address as advised by the WMF, I have a
photograph of the package, and with postage paid for by WikiprojectMed). The
WMF appears to have a long term issue with its goods-in management
contractor, as if a delivery is not through one of their preferred
"friendly" U.S. delivery services, they manage to always lose the package.
In the U.K. this would be considered racketeering and I'm amazed that the
WMF has put up with this behaviour from their supplier for a period of years.

I do not know who the Royal Mail use as a supplier for deliveries in
California, if the WMF insist that all deliveries must happen through named
delivery agents due to a "closed shop" policy from her supplier, I suggest
that the she pays for the delivery rather than expecting GLAM institutions
to collaborate in activities that may fail to meet UK charity law, and be
more expensive than standard international Royal Mail from the UK.

I'm under the impression its not the contractor who has the problem, but that its actually illegal under united states law for usps to deliver to them (usa is a weird place...).

In any case, you were told in several places not to send via usps or the disk wouldnt get there. That part is on you.

A second disk WMF provided was sent within the UK (postage paid by me). Due
to the length of time passing with no update or responses, I have gone ahead
and started a slow direct upload instead. At the time of writing I'm around
20% complete so as mentioned earlier this is a much more efficient and
effective process than hoping the WMF can help out as had I done this in the
first place I would have completed the upload at least a calendar month ago.

So what happened here? Can we confirm if the intermediary even recieved it?

(In reply to Fæ from comment #33)

I do not know who the Royal Mail use as a supplier for deliveries in
California,

Why do you mention California? The shipment was addressed to eqiad (Virginia). (or else we have bigger problems than I realized)

Executive summary: Disk sent correctly to USA, disk received correctly, disk
"lost" by WMF goods-in supplier.

The Wellcome USB disk was sent and received correctly as legally verified by
Royal Mail (with a perfectly correct address as advised by the WMF, I have a
photograph of the package, and with postage paid for by WikiprojectMed). The
WMF appears to have a long term issue with its goods-in management
contractor, as if a delivery is not through one of their preferred
"friendly" U.S. delivery services, they manage to always lose the package.
In the U.K. this would be considered racketeering and I'm amazed that the
WMF has put up with this behaviour from their supplier for a period of years.

I'm not sure what you mean by contractor here. The WMF does not own the building that the shipment was addressed to. They rent server cages/rackspace/etc. in that building directly from the people that manage the receipt of shipments. WMF (and I guess many other customers of that datacenter too) is unable to staff the delivery point at all hours when packages are accepted so the datacenter accepts them on behalf of tenants if and only if they are delivered to the right place and with a tracking number that has been registered with the receiving staff in advance.

if the WMF insist that all deliveries must happen through named
delivery agents due to a "closed shop" policy from her supplier,

I believe at least 4 different carriers have successfully delivered to eqiad. I think "racketeering" (as you call it) is probably not the issue in this case.

I suggest
that the she pays for the delivery rather than expecting GLAM institutions
to collaborate in activities that may fail to meet UK charity law, and be
more expensive than standard international Royal Mail from the UK.

I don't think there's anything wrong with the way this shipment was handled on WMF end.

You were clearly told by multiple people (across 3 separate mails) which carriers we already knew worked (fedex/ups/dhl) and yet you insisted that Royal Mail would somehow end up being delivered by DHL. I thought it was more likely that Royal Mail would become USPS. Maybe we'll never find out exactly what happened.

I believe the problem with USPS is that they deliver to the wrong place. (Maybe the building has multiple entrances and all other carriers have learned where to go but USPS is unwilling to adapt? not sure)

Domestic letters in the US are returned to their sender free of charge if the address is wrong or delivery is refused. I guess the same happens with domestic packages? but I have no idea about international shipments.

(In reply to Bawolff (Brian Wolff) from comment #34)

I'm under the impression its not the contractor who has the problem, but
that its actually illegal under united states law for usps to deliver to
them (usa is a weird place...).

huh, maybe that explains them delivering to the wrong place. I may go read that law someday...

In any case, you were told in several places not to send via usps or the
disk wouldnt get there. That part is on you.

right, see above

re process for future:

(can we move this part to a mailing list? [[mail:glam]]?)

I think it's reasonable to expect batch uploads to expend some of their own time/effort/bandwidth on an upload. (including making sure that the format is right with pairs of images and .txt files (complete file description pages in mediawiki syntax) with matching names) And in any case things do get lost in the mail in the normal course of running a mail service. There should always be a backup plan if you intend to coordinate timing with a scheduled event. (e.g. upload a subset or find a place with more bandwidth to do the upload from)

Keep in mind that shipping an HDD or other media (CD, USB stick, etc.) requires a non-trivial amount of time and effort from datacenter staff and then developers (shell users).

The gains from mailing vs. upload via internet will vary from location to location (some places have faster links than others) and maybe the option to have chapters handle some of these should be explored (comment 28). That would allow for some shipments to be domestic instead of international (or at least within EU instead of intercontinental). (would that make shipping more reliable? or at least cheaper?)

I think that bug 48205 went pretty well (comments 11 through 18) and that it could be replicated for future uploads.