Page MenuHomePhabricator

Corrupt images should be detected and reported - by humans or automatic script.
Closed, ResolvedPublic

Description

We've reports on Wikimedia Commons thumbnails can't be generated because images are corrupt.


[ Analysis ]

Well, the problem is these images seems to really be corrupted.

e.g. [[Commons:File:Augustinusbishop.gif]]

/home/dereckson ] fetch https://upload.wikimedia.org/wikipedia/commons/7/7b/Augustinusbishop.gif
Augustinusbishop.gif 100% of 346 kB 889 kBps
/home/dereckson ] mogrify Augustinusbishop.gif -resize 200x300
mogrify: corrupt image `Augustinusbishop.gif' @ error/gif.c/ReadGIFImage/1348.


[ A heavy to maintain solution ]

We should have a script verifying periodically our pictures and reporting corrupted images detected.

PIL can detect such images with the verify method.

Here a sample script (works for any other format supported by PIL too):
https://bitbucket.org/denilsonsa/small_scripts/src/tip/jpeg_corrupt.py

The infrastructure we need should be optimized to detect at least 100 000 pictures per day (= 1.15 image per seconds), so if it runs continuously we can have every picture verified every 150 days.


It should be evaluated if this is needed or if human reporting would work better.

We also have to involve Wikimedia Commons community to manually fix the corrupted pictures.


Version: unspecified
Severity: normal

Details

Reference
bz41380

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:14 AM
bzimport set Reference to bz41380.
bzimport added a subscriber: Unknown Object (MLST).

Note a clever solution could be to intercept the error when the thumbnail generation issue occur and generate a list of such files.

CC'ing Aaron.
Aaron, could you take a look at this maintenance script and provide some feedback? Do you think that this could be incorporated in the long run?

AntiCompositeNumber assigned this task to TheSandDoctor.

@TheSandDoctor has a bot running to detect and tag corrupt images using PIL/Pillow.