Page MenuHomePhabricator

Generate thumbnails based on buckets
Closed, ResolvedPublic

Description

The idea is to offer an option that will allow to generate thumbnails based on a chain. I.e. a given thumbnail would be generated based on a bigger thumbnail rather than on the original whenever possible. This should greatly increase performance for large files, and if good bucket values are picked, the visual impact should be unnoticeable.

I verified that the visual impact would be minimal with a power-of-2 progression with chaining up to 5 thumbnails by running an informal survey which I invited developers and commons users to participate in.


Version: unspecified
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=67698

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:34 AM
bzimport set Reference to bz67525.
bzimport added a subscriber: Unknown Object (MLST).

Change 135008 had a related patch set uploaded by Gilles:
Generate thumbnails based on buckets

https://gerrit.wikimedia.org/r/135008

Change 135008 merged by jenkins-bot:
Generate thumbnails based on buckets

https://gerrit.wikimedia.org/r/135008

Change 145132 had a related patch set uploaded by Gergő Tisza:
Add thumbnail buckets for beta sites

https://gerrit.wikimedia.org/r/145132

Change 145132 merged by jenkins-bot:
Use reference thumbnails for JPEG/PNG thumbnailing on beta sites

https://gerrit.wikimedia.org/r/145132

Doesn't seem to be working on beta.

Steps taken to verify:

  1. open http://upload.beta.wmflabs.org/wikipedia/en/thumb/4/4d/Snowman.JPG/1000px-Snowman.JPG in browser
  2. ssh (via the labs bastion) to deployment-upload
  3. ls ls /data/project/upload7/wikipedia/en/thumb/4/4d/Snowman.JPG/

Expected result: 1000px-Snowman.JPG and 2048px-Snowman.JPG should be present
Actual result: only 1000px-Snowman.JPG is present

Same for deployment-cache-upload02.

I took the server names from https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/Overview which is very outdated, and guessed the directory path from apache and puppet config files, so I might have gotten something wrong. However, the requested image size does appear, only the bucket sizes are missing, and those should be in the same directory, so it seems something is not quite working there.

deployment-upload has a thumb script which looks like it has been forked off thumb.php several years ago, and it forwards to deployment-cache-text02, so maybe that is the box which acutally acts as a scaler. The X-Wikimedia-Thumb header on the generated image also points there. Still, /data/project/ is a network share, it should not matter where I am looking at it.

I verified that the visual impact would be minimal with a power-of-2
progression with chaining up to 5 thumbnails by running an informal survey
which I invited developers and commons users to participate in.

What sort of pictures were on the survey? What settings were used, etc?

I just tested the change (using the settings currently at beta cluster). On the first image I tried (a screenshot, which might not be representative of the average content of a png file due to the large amount of small text, but it was something already on my test wiki) the quality was noticeably less (although possibly still in the acceptable range) with bucketing (and things got even worse if one used bucketting + vips).


(In reply to Tisza Gergő from comment #5)

Doesn't seem to be working on beta.

Another way to test:

Run http://upload.beta.wmflabs.org/wikipedia/labs/thumb/8/8b/Bn.beta.wmflabs.org.PNG/310px-Bn.beta.wmflabs.org.PNG through exiftool, you get:

[..]
Thumb Imageheight : 1024
Thumb Image Width : 1280
Thumb URI : file:///data/project/upload7/wikipedia/labs/8/8b/Bn.beta.wmflabs.org.PNG
[..]

Which would be different if bucketing was working

What sort of pictures were on the survey? What settings were used, etc?

One or two control images that had no specific qualities as well as several images with a lot of edges: https://www.surveymonkey.com/s/F6CGPDJ The reason I picked images with a lot of edges is that they're generally the images that gather the most complaints when we tweak thumbnailing.

The same logic and parameters as the patch were applied, with ImageMagick as the scaler. Each image shown in the survey had been through 3 to 5 chaining steps. There's no denying that there is a proven quality loss on a technical level, just by virtue of resampling, but the survey results were clear about the fact that on average the chained ones were slightly preferred. Presumably because of the extra sharpening (most chaining steps meet the criteria of the sharpening check in the code). The old code sharpens once from the original, the new code may sharpen once for each chain step.

The reason why thumbs are sharpened in the first place - a common practice on large websites - is that people find sharpened thumbnails to look better even when mathematically speaking they aren't (on the contrary, more original content is getting lost). It's probably because the conserved edges help with the way we recognize shapes and detail. I.e. our brain will have an easier time compensating for the loss of detail if edges are stronger, even if artificially conserving the edges actually makes more original detail disappear.

And so, with chaining the edges are conserved a bit better, which is probably why they're favored in the survey results. That's the only theory I have about the counter-intuitive results. When I ran it, I was expecting to see that the chained ones would be disfavored, in which case it would have been a balancing act between what people tolerate visually and server resources.

That being said, maybe if we cranked up the sharpening value on the default code, people would prefer that to the chained thumbnails. I didn't try to run the survey another time to find out. But the goal here is to save server resources while not upsetting people, not to improve the thumbnail quality/popularity. It seems like a reasonable balance to me to launch a recipe that people don't noticeably dislike on average compared to the status quo.

I don't see this change as a big risk, anyway, because if a vocal minority campaigns against it, it's easy to revert and purge the images. The B plan is to perform the same chaining but to store each bucket size as a lossless format. We'd have the same speed gains, but the vastly increased storage needs means that we can't do that while we're still storing thumbnails in Swift.

I think the surveys only used JPEG images, and Brian said he tested with a PNG, so maybe this is another counterintuitive result where the perceived quality loss is greater for PNG images (which have more sharp details than JPEGs)?

Ah yes, I'll look into PNG closer. If PNG resizing doesn't do any sharpening, that might be the explanation.

Gergo, I don't have access to deployment-upload, could you give it another try, now that the fixed beta config is out?

Brian, I just wanted to check if the quality issues you encountered were due to not having defined the minimum distance, or if you have repro steps with sample images that I could use?

Created attachment 16331
file that scaled badly (unscaled original)

I used:

$wgThumbnailBuckets = array( 256, 512, 1024, 2048, 4096 );
$wgThumbnailMinimumBucketDistance => 32;

As my setting (And tried both with, and without VIPS enabled. Results were much worse with VIPS, but they were bad with image magick too).

The file I was testing (original, unscaled) is attached

Attached:

TestUpload2.png (800×1 px, 227 KB)

Could you give me the $wgVips* configuration settings you were using? This way I can generate the same set of thumbnails with VIPS.

For testing VIPS I was using:

$wgVipsOptions = array(

array(
        'conditions' => array(
                'mimeType' => 'image/png',
        )

);

(The main difference from production is that production has a minsize parameter.)

Created attachment 16513
My test screenshot file, 800px normal (no bucket) scaling with image magick

800px output on normal scaling (bucketing disabled, using image magick). Looks very nice.

Attached:

normal-scaling.png (500×800 px, 222 KB)

Created attachment 16514
800px chained scaling using image magick

Using chaining (Note, only 1 intermediate thumbnail. Original is 1280px, then it makes an intermediate of 1024px, and then does target of 800px. Possibly bigger different if several buckets involved).

Image is more "fuzzy", and small text in image is harder to read. Definitely noticeable if doing side by side comparison. However, quality may still be acceptable.

Attached:

im-chained-scaling.png (500×800 px, 207 KB)

Created attachment 16515
800px chained, scaling using vips

Using chained with vips (1280->1024->800).

Note on production, VIPS is not used for small images, so in practice this might not be as much of an issue since VIPS would only be used on the very biggest bucket size. Maybe. Possibly needs more experimentation to see.

Text in image is significantly harder to read, and quality of image is noticeably less

Attached:

vips-chained-scaling.png (500×800 px, 144 KB)

That's very odd, the chaining-free one I had was equally fuzzy as the chained one, which is why I didn't notice the difference. I'll try to figure out what happened there. This might be related to the sharpening the code does with IM. As for VIPS, if it's not doing any sharpening, that would explain it.

I can finally repro and I know why it's happening: currently mediawiki doesn't do any sharpening for PNGs. It does for JPGs, and the extra sharpening passes in the chaining compensate the quality loss in terms of perceived quality. I had been testing mostly with JPGs during the development of this feature, which is why I missed the fact that PNGs are different in regards to sharpening.

In order to preserve a decent amount of perceived quality, PNGs need to be sharpened when chained. I'll experiment locally and come up with a follow-up patch, but since the impact on perceived quality had only been tested on JPGs, I think that releasing this to production should be restrained to JPGs at first. I'll treat the release for PNG separately and I'll run another user test for perceived quality on them.

Change 162279 had a related patch set uploaded by Gilles:
Disable thumbnail chaining support for PNGs

https://gerrit.wikimedia.org/r/162279

Change 162279 merged by jenkins-bot:
Disable thumbnail chaining support for PNGs

https://gerrit.wikimedia.org/r/162279

Change 170747 had a related patch set uploaded by Gilles:
Enable JPG thumbnail chaining on beta

https://gerrit.wikimedia.org/r/170747

Change 170747 merged by jenkins-bot:
Enable JPG thumbnail chaining on beta

https://gerrit.wikimedia.org/r/170747

Change 172254 had a related patch set uploaded by Gilles:
Enable JPG thumbnail chaining on all wikis except commons

https://gerrit.wikimedia.org/r/172254

Change 172254 merged by jenkins-bot:
Enable JPG thumbnail chaining on all wikis except commons

https://gerrit.wikimedia.org/r/172254

Change 172969 had a related patch set uploaded by Gilles:
Don't re-apply EXIF rotation to chained thumbnails

https://gerrit.wikimedia.org/r/172969

Nemo: because of https://bugzilla.wikimedia.org/show_bug.cgi?id=73352 which https://gerrit.wikimedia.org/r/172969 aims to fix.

Not sure how long the review process will take for that one, so it's best to not generate thumbnails with the wrong orientations in production in the meantime.

Change 172969 merged by jenkins-bot:
Don't re-apply EXIF rotation to chained thumbnails

https://gerrit.wikimedia.org/r/172969

Change 174453 had a related patch set uploaded by Gilles:
Don't re-apply EXIF rotation to chained thumbnails

https://gerrit.wikimedia.org/r/174453

Change 174453 merged by jenkins-bot:
Don't re-apply EXIF rotation to chained thumbnails

https://gerrit.wikimedia.org/r/174453

Change 176912 had a related patch set uploaded (by Gilles):
Enable JPG thumbnail chaining on Commons

https://gerrit.wikimedia.org/r/176912

Patch-For-Review

Change 176912 merged by jenkins-bot:
Enable JPG thumbnail chaining on Commons

https://gerrit.wikimedia.org/r/176912

This causes T76487 but that warning looks harmless.

So far so good, I'll keep an eye out for trouble. Doing this for other formats than JPGs will be the subject of other tasks.