Page MenuHomePhabricator

Operational issues for very large TIFFs
Closed, ResolvedPublic

Description

when a bot or user visits a wiki’s SpecialNewFiles page, and some other pages like this page, missing thumbnails are created on the fly. this can potentially flood the server(s) with thumbnail creation jobs, which slow down the wiki or potentially bring down its ability to server web pages. GWToolset has the potential to create this situation when it uploads several large mediafiles at once @see http://lists.wikimedia.org/pipermail/glamtools/2014-May/000135.html.

during the zürich hackathon i spoke with aaron schultz, faiden liambotis, and brion vibber regarding approaches to dealing with this issue. in summary, the idea aaron came up with is to create initial thumbnails on download of the original mediafile to the wiki. this should block the appearance of the title on the new files page and anywhere else until the thumbnails and title creation/edit have completed. aaron thought, and faidon and i agree, that further throttling of gwtoolset will not help resolve the issue.

i am currently looking into implementing this approach and will use this bug to track activity on it.


Version: unspecified
Severity: critical
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=49118

Details

Reference
bz65217

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:21 AM
bzimport set Reference to bz65217.
bzimport added a subscriber: Unknown Object (MLST).

my initial thoughts on how to approach this, utilising methods within thumb.php, are not accessible to jobs run in the job queue.

another approach, discussed with gilles and gergo in irc, involves uploading the media file to an upload stash, creating thumbnails based on that media file, and then creating the title for the media file. this requires re-architecting the way the job queue jobs currently run, which i don’t have time to work on at the moment. will try and get to this when time permits.

The consensus on the ops list was that https://gerrit.wikimedia.org/r/#/c/132112/ is not enough to safely resume uploads, and bug 52045 probably would not help much. The current plan is to

  • extract a large thumbnail from the file, and use that thumbnail to create smaller thumbnails (possibly in a chain, i.e. use some of those smaller thumbnails to create even smaller thumbnails)
  • make this thumbnail generation happen immediately after upload
  • limit the number of expensive thumbnail generations that can happen in parallel

I recently realized that we still download the source file, even if its above $wgMaxImageArea (e.g. https://commons.wikimedia.org/wiki/File:Map_of_New-York_Bay_and_Harbor_and_the_environs_-_founded_upon_a_trigonometrical_survey_under_the_direction_of_F._R._Hassler,_superintendent_of_the_Survey_of_the_Coast_of_the_United_States;_NYPL1696369.tiff is a 540 mb file, which takes 37 seconds just to get to the error message that says we aren't even going to attempt to thumbnail the file). I've submitted https://gerrit.wikimedia.org/r/135101 to fix this.

I've missed much of the events that unfolded around this situation. Looking back in mailing list archives, I'm not even clear if it is swift being overloaded, or time taken to actually thumbnail the image that's the problem (or both. Or something else). One of the earlier emails says:

We just had a brief imagescaler outage today at approx. 11:20 UTC that
was investigated and NYPL maps were found to be the cause of the outage.
Besides the complete outage of imagescaling, Swift's (4Gbps) bandwidth
was saturated again, which would cause slowdowns and timeouts in file
serving as well.

So possibly (correct me if I'm off base here) its just swift network connection being overloaded, which in turn causes the image scalars to have to wait longer before getting the original image asset is delivered to them, causing them to be overloaded. If so, the fact we are fetching the original > 100 mb source file, only to not even try to scale it, and doing so repetitively until 4 attempts at a specific file width trigger attempt-failures to stop it for an hour on that particular size only, may be a very significant contributor to the situation.

The attempt-failures thing only increments the cache key after the attempt failed. Given it was taking ~ 38 seconds just to download the file to the image scalar (in the case I tried), A lot of people could try and render that file in that time before the key is incremented (Still limited by the pool counter though). Maybe that key should be incremented at the beginning of the request. Sure in certain situations a couple people might get an error for the couple of seconds it takes a good file to render, but that would only last a couple seconds and would much more quickly limit the damage a stampede of people requesting a hard to render file could do.

I was reading over the thread on multimedia - I'm not entirely sure the Special:Newfiles theory makes sense, I think its more likely someone maybe viewed a category of the tiff uploads from gwtoolset or something like that.

So we have this graph of april 21, with a peak about 2:55 to 3:20 UTC http://lists.wikimedia.org/pipermail/multimedia/attachments/20140420/35015082/attachment-0001.png

However when you look at the uploads from around the time, the peak in large tiff uploads do not correspond with a peak in the graph:

MariaDB [commonswiki_p]> select substring( img_timestamp, 9, 3) "time", count(*) "# images", round(MAX(img_width*img_height/1000000)) "max Mpx", round( avg(img_width*img_height/1000000)) "avg mpx", round(avg (img_size/(1024*1024))) "avg MB", round(sum(img_size/(1024*1024))) "total mb", round( max( img_size/(1024*1024))) "max mb" from image where img_timestamp > '20140421010000' and img_timestamp < '20140421050000' and img_minor_mime = 'tiff' and img_user_text = 'Fæ' group by substring( img_timestamp, 1, 11);
+------+----------+---------+---------+--------+----------+--------+

time# imagesmax Mpxavg mpxavg MBtotal mbmax mb

+------+----------+---------+---------+--------+----------+--------+

0104060421214822172
0114039391104409112
0121960421202280172
0133760601716328173
0141760601722916173
0152060601713427173
0203560601715986173
0211560601702555172
0222660601724463173
0231860601713079173
030660591701018173
03256060171857173
03326060172343173

+------+----------+---------+---------+--------+----------+--------+
13 rows in set (0.01 sec)

That is between 2:50-3:20 there was a total of 6 tiff files uploaded by Fae with gwtoolset (out of 141 total uploads in that time period, 4.2%), compared to say 1:00-1:30 which didn't have a spike but had 99 tiff files uploaded by fae (compared to 373 total, 27%). If it was caused by viewing Special:Newfiles, I would expect the spike would come when the 99 tiffs were uploaded instead of when the 6 tiffs were uploaded.

Which leads me to suspect the issue was not with people viewing Special:NewFiles a lot, but maybe viewing something else that had a lot of uncached thumbnail hits associated. Maybe the category for the batch upload, which would have up to 200 images on it, probably a lot over the $wgMaxImageArea so triggering what I mentioned in comment 4 - and the rest might simply have not been viewed before, was viewed by several someones at the same time. [[Commons:Category:NYPL maps (over 50 megapixels)]] was linked in the VP at the time (although it had been for about a day), maybe somebody just hit reload on that page repetitively for some unknown reason and that overloaded things. Or something.

With all that said, I guess even if it wasn't Special:Newfiles, it probably doesn't change much as its still related to on-demand thumbnailing.

(In reply to Bawolff (Brian Wolff) from comment #5)

I was reading over the thread on multimedia - I'm not entirely sure the
Special:Newfiles theory makes sense, I think its more likely someone maybe
viewed a category of the tiff uploads from gwtoolset or something like that.

<snip>

Which leads me to suspect the issue was not with people viewing
Special:NewFiles a lot, but maybe viewing something else that had a lot of
uncached thumbnail hits associated. Maybe the category for the batch upload,
which would have up to 200 images on it, probably a lot over the
$wgMaxImageArea so triggering what I mentioned in comment 4 - and the rest
might simply have not been viewed before, was viewed by several someones at
the same time. [[Commons:Category:NYPL maps (over 50 megapixels)]] was
linked in the VP at the time (although it had been for about a day), maybe
somebody just hit reload on that page repetitively for some unknown reason
and that overloaded things. Or something.

With all that said, I guess even if it wasn't Special:Newfiles, it probably
doesn't change much as its still related to on-demand thumbnailing.

You could be on to something. For example, all of the thumbnails in [[commons:Category:Sanborn maps of Staten Island]] are broken when you go to view an image in full resolution. It doesn't have to be someone hitting reload repeatedly, the call for the thumb regenerates on its own once it fails. For example:

https://upload.wikimedia.org/wikipedia/commons/thumb/9/9d/Staten_Island%2C_Plate_No._12_%28Map_bounded_by_Boyd%2C_Brooks%2C_Mc_Keon%2C_Varian_Cedar%29_NYPL1957089.tiff/lossy-page1-3000px-Staten_Island%2C_Plate_No._12_%28Map_bounded_by_Boyd%2C_Brooks%2C_Mc_Keon%2C_Varian_Cedar%29_NYPL1957089.tiff.jpg

I can open that up in a background browser tab and it just keeps hitting the server over and over for thumbnail requests.

(In reply to Keegan Peterzell from comment #6)

https://upload.wikimedia.org/wikipedia/commons/thumb/9/9d/
Staten_Island%2C_Plate_No.
_12_%28Map_bounded_by_Boyd%2C_Brooks%2C_Mc_Keon%2C_Varian_Cedar%29_NYPL195708
9.tiff/lossy-page1-3000px-Staten_Island%2C_Plate_No.
_12_%28Map_bounded_by_Boyd%2C_Brooks%2C_Mc_Keon%2C_Varian_Cedar%29_NYPL195708
9.tiff.jpg

I can open that up in a background browser tab and it just keeps hitting the
server over and over for thumbnail requests.

I should clarify: My browswer (Chrome 34.0.1847.137 m) is giving different behaviors when I open up images from that gallery. One image failed upon its own refresh call six times before halting and returning the proper error message (There have been too many recent failed attempts (4 or more) to render this thumbnail. Please try again later.) Another image reloaded only twice before halting with no error message. Yet another image just keep reloading without the error message.

(In reply to Keegan Peterzell from comment #7)

I should clarify: My browswer (Chrome 34.0.1847.137 m) is giving different
behaviors when I open up images from that gallery. One image failed upon its
own refresh call six times before halting and returning the proper error
message (There have been too many recent failed attempts (4 or more) to
render this thumbnail. Please try again later.) Another image reloaded only
twice before halting with no error message. Yet another image just keep
reloading without the error message.

And by without the error message, I mean that the server is leaving the field blank.

Error generating thumbnail

Error creating thumbnail:

(In reply to Keegan Peterzell from comment #8)

(In reply to Keegan Peterzell from comment #7)

I should clarify: My browswer (Chrome 34.0.1847.137 m) is giving different
behaviors when I open up images from that gallery. One image failed upon its
own refresh call six times before halting and returning the proper error
message (There have been too many recent failed attempts (4 or more) to
render this thumbnail. Please try again later.) Another image reloaded only
twice before halting with no error message. Yet another image just keep
reloading without the error message.

And by without the error message, I mean that the server is leaving the
field blank.

Error generating thumbnail

Error creating thumbnail:

Well the blank error message is consistent with an out of memory error for a tiff file (Since process gets killed and doesn't output anything to stdout. Other formats return the exit code, but tiff doesn't). However your web browser is not supposed to be loading the page over and over again by itself. My copy of chrome doesn't do that.


Furthermore, looking at the irc logs - http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20140421.txt the servers had issues way up to 14:50 UTC on april 21, which is long after Fae's uploads stop and are off Special:Newfiles/Special:Listfiles. Similarly for the outage at 11:20 utc on May 11 - http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-operations/20140511.txt [[commons:file:Bronx,_V._12,_Double_Page_Plate_No._273_%28Map_bounded_by_Whiting_Ave.,_Ewen_Ave.,_Warren_Ave.,_Hudson_River%29_NYPL2001533.tiff]] is mentioned, which is one of the images uploaded back on april 21, so definitely not on Special:NewFiles. (Also that file is over the $wgMaxImageArea, so gerrit change 135101 would have stopped that particular file from causing a problem. Of course the irc log is unclear if that was the main file causing problems or if it was just one example of many files being currently requested)

(In reply to Tisza Gergő from comment #3)

The consensus on the ops list was that
https://gerrit.wikimedia.org/r/#/c/132112/ is not enough to safely resume
uploads, and bug 52045 probably would not help much. The current plan is to

  • extract a large thumbnail from the file, and use that thumbnail to create

smaller thumbnails (possibly in a chain, i.e. use some of those smaller
thumbnails to create even smaller thumbnails)

I sort of did this for tiff as part of the work to make vips work on tiffs - see gerrit change 135289.

Sorry for the slow response, I got unCCd from this bug somehow.

(In reply to dan from comment #11)

with these gerrit patches merged, and deployed onto production, is it time
for fae to re-try one of his large tiff uploads?

The changes you mention don't really help:

These only help with thumbnails which completely fail to render, and even for those have limited effect (as Bawolff pointed out above - the rendering would still take up time and memory, until the failure threshold is hit).

Also, the first was merged long ago, and the second right after the first outage, so they did not stop the second one.

These don't really do anything without the two pending ones you mention. (Sorry to be so sluggish on this - we were distracted by troubles with the MediaViewer rollout on enwiki. Also, Gilles is on vacation next week, so unless someone else is willing to review them, not much will happen. I hope to get them merged the following week.)

Bawolff's $wgMaxArea patch might help somewhat:

https://gerrit.wikimedia.org/r/#/c/135101/

not sure if the files involved in the second outage were that large, though.

The multi-step scaling patches might also help, once they get merged:
https://gerrit.wikimedia.org/r/#/c/135289/
https://gerrit.wikimedia.org/r/#/c/135008/
(the second one is only for JPEGs at the moment though)

(In reply to Bawolff (Brian Wolff) from comment #4)

The attempt-failures thing only increments the cache key after the attempt
failed. Given it was taking ~ 38 seconds just to download the file to the
image scalar (in the case I tried), A lot of people could try and render
that file in that time before the key is incremented (Still limited by the
pool counter though). Maybe that key should be incremented at the beginning
of the request.

That would be a semaphore, basically (except that its value would decrease with failures). Isn't that what the FileRender poolcounter does already?

(In reply to Tisza Gergő from comment #13)

(In reply to Bawolff (Brian Wolff) from comment #4)

The attempt-failures thing only increments the cache key after the attempt
failed. Given it was taking ~ 38 seconds just to download the file to the
image scalar (in the case I tried), A lot of people could try and render
that file in that time before the key is incremented (Still limited by the
pool counter though). Maybe that key should be incremented at the beginning
of the request.

That would be a semaphore, basically (except that its value would decrease
with failures). Isn't that what the FileRender poolcounter does already?

Yes. You're right.

Cc Sam here because I don't know where else, about:

samwilson> one thing i've been tinkering with is a system of generating thumbnails offline and ploking them in their correct locations. that'd reduce a pile of the out-of-memory things i see on DH [DreamHost] sites.

(Thanks for the heads-up re this, Nemo.)

My thing isn't really a fix! It's just a simple way for the site administrator to be told that some thumbnail is missing, and where it should go in the filesystem, so that they can generate it locally (i.e. Gimp or whatnot) and upload it (via some easy interface, although I've not considered that bit; scp is my usual).

So, not really a help. But good for memory-poor places like Dreamhost!

(In reply to Sam Wilson from comment #16)

(Thanks for the heads-up re this, Nemo.)

My thing isn't really a fix! It's just a simple way for the site
administrator to be told that some thumbnail is missing, and where it should
go in the filesystem, so that they can generate it locally (i.e. Gimp or
whatnot) and upload it (via some easy interface, although I've not
considered that bit; scp is my usual).

So, not really a help. But good for memory-poor places like Dreamhost!

[Slightly off topic] how memory poor is dreamhost?

Their shared hosting: 90M. Acutally, I think imagemagick failures are also the processes running too long and being kissed.

Agh, *killed*. Unless DH is the mafia I guess...

This bug is rather confusing.

It seems like most of the band-aid solutions have been implemented. If we want a bug about how shuffling multi-gb files between servers on demand is a rather poor architecture, I think it would be better to start a separate bug.

Can I close this one as resolved? If not, what are the outstanding issues here that are actionable?

Yes, I think we can assume this fixed (unless another outage happens).

Bawolff claimed this task.