Page MenuHomePhabricator

10s miss service time for thumbnails requested via ms6.esams
Closed, ResolvedPublic

Description

Author: saibotrash

Description:
Thumbnails take ages (10 seconds) to be rendered after a purge.

Testcase 1:
http://commons.wikimedia.org/w/index.php?title=File:AaatestSonnepalmenstrand90.jpg&action=purge

wget http://upload.wikimedia.org/wikipedia/commons/thumb/0/01/AaatestSonnepalmenstrand90.jpg/303px-AaatestSonnepalmenstrand90.jpg
--2011-11-15 15:58:20-- http://upload.wikimedia.org/wikipedia/commons/thumb/0/01/AaatestSonnepalmenstrand90.jpg/303px-AaatestSonnepalmenstrand90.jpg
Auflösen des Hostnamen »upload.wikimedia.org«.... 91.198.174.234
Verbindungsaufbau zu upload.wikimedia.org|91.198.174.234|:80... verbunden.
HTTP Anforderung gesendet, warte auf Antwort... 200 OK
Länge: nicht spezifiziert [image/jpeg]
In »303px-AaatestSonnepalmenstrand90.jpg« speichern.

[  <=>                                  ] 16.695      68,7K/s   in 0,2s

2011-11-15 15:58:31 (68,7 KB/s) - »303px-AaatestSonnepalmenstrand90.jpg« gespeichert [16695]

Download took 0.2 seconds but wget was waiting for the server to answer for 10 seconds.

Testcase 2:
Call http://commons.wikimedia.org/w/index.php?title=File:Turm_am_Thurn-und-Taxis-Platz_2010-02-08.jpg&action=purge

Then:
wget http://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Turm_am_Thurn-und-Taxis-Platz_2010-02-08.jpg/308px-Turm_am_Thurn-und-Taxis-Platz_2010-02-08.jpg
--2011-11-15 16:02:04-- http://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Turm_am_Thurn-und-Taxis-Platz_2010-02-08.jpg/308px-Turm_am_Thurn-und-Taxis-Platz_2010-02-08.jpg
Auflösen des Hostnamen »upload.wikimedia.org«.... 91.198.174.234
Verbindungsaufbau zu upload.wikimedia.org|91.198.174.234|:80... verbunden.
HTTP Anforderung gesendet, warte auf Antwort... 200 OK
Länge: nicht spezifiziert [image/jpeg]
In »308px-Turm_am_Thurn-und-Taxis-Platz_2010-02-08.jpg« speichern.

[  <=>                                  ] 35.224      98,1K/s   in 0,4s

2011-11-15 16:02:15 (98,1 KB/s) - »308px-Turm_am_Thurn-und-Taxis-Platz_2010-02-08.jpg« gespeichert [35224]

→ 9 seconds of waiting at step "HTTP Anforderung gesendet, warte auf Antwort..." (that is after(!) DNS lookup)

A subsequent query of the same image is fast:
wget http://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Turm_am_Thurn-und-Taxis-Platz_2010-02-08.jpg/308px-Turm_am_Thurn-und-Taxis-Platz_2010-02-08.jpg
--2011-11-15 16:03:13-- http://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Turm_am_Thurn-und-Taxis-Platz_2010-02-08.jpg/308px-Turm_am_Thurn-und-Taxis-Platz_2010-02-08.jpg
Auflösen des Hostnamen »upload.wikimedia.org«.... 91.198.174.234
Verbindungsaufbau zu upload.wikimedia.org|91.198.174.234|:80... verbunden.
HTTP Anforderung gesendet, warte auf Antwort... 200 OK
Länge: 35224 (34K) [image/jpeg]
In »308px-Turm_am_Thurn-und-Taxis-Platz_2010-02-08.jpg.1« speichern.

100%[======================================>] 35.224 --.-K/s in 0,1s

2011-11-15 16:03:13 (236 KB/s) - »308px-Turm_am_Thurn-und-Taxis-Platz_2010-02-08.jpg.1« gespeichert [35224/35224]

Below a second.

I get similar timing when watching the net traffic after purge with Firebug in Firefox.


Version: unspecified
Severity: minor
URL: http://upload.wikimedia.org/wikipedia/commons/0/01/AaatestSonnepalmenstrand90.jpg

Details

Reference
bz32432

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 12:05 AM
bzimport set Reference to bz32432.
bzimport added a subscriber: Unknown Object (MLST).

TIP: Under unix, you can use LC_MESSAGES="C" before your command to have the output using the default C language (aka english).

We probably are lacking some CPU power for generating thumbnails. Is this an issue with those two images or is that a general issue?

Please note we have a project to replace ImageMagick (software used to render thumbnails) by VIPS which will give us better performances. See:
http://www.gossamer-threads.com/lists/wiki/wikitech/257118
and bug 28135

saibotrash wrote:

(In reply to comment #2)

TIP: Under unix, you can use LC_MESSAGES="C" before your command to have the
output using the default C language (aka english).

off topic: Thanks, will try to remember :-)

We probably are lacking some CPU power for generating thumbnails. Is this an
issue with those two images or is that a general issue?

General issue. Those were just to demonstrate.

I would expect no longer rendering times than about 3 seconds. That is really not a computational hard task. Raw processing time is probably fractions of a second.

It is like this since weeks. I did not notice this before the 1.18 update.

It's not necessarily the thumbnail *generation* that's slow. The thumbnail *storage* server (ms5) is also known to be slow, and it wouldn't be hit in normal operation because the image would be cached by the upload Squids.

So when generating a new thumb you're hitting both the scalers and ms5, and when requesting a cached thumb you're hitting neither. That makes it hard to tell which one is the culprit from your timing data alone.

saibotrash wrote:

(In reply to comment #4)

That makes it hard to
tell which one is the culprit from your timing data alone.

Can't you reproduce it?

mr.heat wrote:

Should not be to hard to reproduce this. Currently it happens all the time for all new images or simply when creating a gallery in a size that was never rendered before. For example, go to http://toolserver.org/~daniel/WikiSense/Gallery.php?wikifam=commons.wikimedia.org and switch to 250 images per page.

I know this is not very helpful for you. What can we do to track this down? I think it's a major issue especially in conjunction with old thumbnails not updating, see bug 31680 and others.

alexander_berlin wrote:

Especially list articles with many images suffer from this problem. Sometimes it takes ages to load such a page with dozens of thumbnails. After creating all the thumbnails, the performance is OK for a few hours. But the next day the kerfuffle starts again.

Are the thumbnails being purched in between?

Bumping to highest, tagging platformeng to get someone to look at this ASAP.

I imagine this is going to be rendered obsolete by the SwiftMedia deployment and by VIPS when we get there, but there might be a short term problem. Putting on the "ops" keyword in case there's something that can be tuned in the short-term.

Marking invalid. The supplied test case works for me.

Yes, it's possible that at some point in the past, thumbnails were slow to render, but urgent site issues should be discussed on IRC while they're actually happening, not on Bugzilla 6 weeks after the fact.

Maybe at some point in the future we will have sufficient historical service time data to let us confirm or maybe even diagnose this sort of thing after the fact. But at the moment we are mostly limited to real time analysis.

(In reply to comment #6)

Should not be to hard to reproduce this. Currently it happens all the time for
all new images or simply when creating a gallery in a size that was never
rendered before. For example, go to
http://toolserver.org/~daniel/WikiSense/Gallery.php?wikifam=commons.wikimedia.org
and switch to 250 images per page.

The browser should limit concurrency to about 5, so say if it takes 2 seconds per image, then that would be 1 minute 40 to load all images. Is that what you see?

We only have 60 threads on 48 cores for the whole image scaling cluster, so obviously if you want 250 images at a time, it's going to take a while, even with unlimited client-side concurrency. If you don't want it to be slow, then don't do that.

Maybe ms5 will hit an I/O limit before we get to 60 threads, but it's clear from the profiling data that ms5 is not permanently in such a state. The average thumb.php service time measured by MediaWiki is only 156ms.

saibotrash wrote:

(In reply to comment #10)

Marking invalid. The supplied test case works for me.

Yes, it's possible that at some point in the past, thumbnails were slow to
render, but urgent site issues should be discussed on IRC while they're
actually happening, not on Bugzilla 6 weeks after the fact.

One time there is no one on IRC who responds in wikimedia-tech, next time I get told to file it to bugzilla... *grr* Don't tell my bug report is INVALID!

Maybe at some point in the future we will have sufficient historical service
time data to let us confirm or maybe even diagnose this sort of thing after the
fact. But at the moment we are mostly limited to real time analysis.

(In reply to comment #6)

Should not be to hard to reproduce this. Currently it happens all the time for
all new images or simply when creating a gallery in a size that was never
rendered before. For example, go to
http://toolserver.org/~daniel/WikiSense/Gallery.php?wikifam=commons.wikimedia.org
and switch to 250 images per page.

The browser should limit concurrency to about 5, so say if it takes 2 seconds
per image, then that would be 1 minute 40 to load all images. Is that what you
see?

We only have 60 threads on 48 cores for the whole image scaling cluster, so
obviously if you want 250 images at a time, it's going to take a while, even
with unlimited client-side concurrency. If you don't want it to be slow, then
don't do that.

Maybe ms5 will hit an I/O limit before we get to 60 threads, but it's clear
from the profiling data that ms5 is not permanently in such a state. The
average thumb.php service time measured by MediaWiki is only 156ms.

I saw it also for a single image as I have clearly written above.. maybe your reading ability is INVALID?!

Testcase 1 still is like I reported it. Maybe you should try from a German IP.

Thanks for your report on IRC. I was able to track down the problem. It turned out to be the configuration of a proxy server in the Amsterdam cluster. It didn't have anything to do with thumbnail rendering or MediaWiki. Instead, nginx was waiting for a connection timeout.

saibotrash wrote:

(In reply to comment #12)

Thanks for your report on IRC. I was able to track down the problem. It turned
out to be the configuration of a proxy server in the Amsterdam cluster. It
didn't have anything to do with thumbnail rendering or MediaWiki. Instead,
nginx was waiting for a connection timeout.

Confirm. Seems to be fixed. Thanks.