Page MenuHomePhabricator

Images with wrong SHA1
Closed, ResolvedPublic

Description

Using enwiki_p, I'm getting data like this:

mysql> SELECT DISTINCT enwiki_p.page.page_title, commonswiki_p.image.img_name

-> FROM enwiki_p.image, commonswiki_p.image, enwiki_p.categorylinks, enwiki_p.page
-> WHERE enwiki_p.image.img_sha1 = commonswiki_p.image.img_sha1
-> AND enwiki_p.page.page_title = enwiki_p.image.img_name
-> AND enwiki_p.categorylinks.cl_from = enwiki_p.page.page_id
-> AND enwiki_p.categorylinks.cl_to = 'All_non-free_media'
-> LIMIT 50;

+----------------+-----------------------------------------------+

page_titleimg_name

+----------------+-----------------------------------------------+

Imas360_10.jpg+-_of_Led.svg
Imas360_10.jpg5von10.png
Imas360_10.jpgAlfred_de_Musset.jpg
Imas360_10.jpgAmphipodredkils.jpg
Imas360_10.jpgAmphoe_6502.png
Imas360_10.jpgAschenbecher_mit_Mechanik1.jpg
Imas360_10.jpgAustria_1945-55.png
Imas360_10.jpgBakaiku.JPG
Imas360_10.jpgBakweri_cocoyam_farmer_from_Cameroon.jpg
Imas360_10.jpgBartolomeu_Dias_Voyage.PNG
Imas360_10.jpgBenjamin_West.jpg
Imas360_10.jpgBlason-fr-en-Saint-Moreil.svg
Imas360_10.jpgBrno-Nový_Lískovec_from_Petrov_(Brno).JPG
Imas360_10.jpgBrännkyrka_kyrka_2005-09-04nr1.jpg
Imas360_10.jpgBundesautobahn_113_number.svg
Imas360_10.jpgClock_UT+7.png
Imas360_10.jpgCoat_of_Arms_of_Antigua_and_Barbuda.gif
Imas360_10.jpgCodex_egberti_-_egbert.jpg
Imas360_10.jpgCold_fingers.png
Imas360_10.jpgCross.png
Imas360_10.jpgCutty_sark_October_2003.jpg
Imas360_10.jpgDNAn+1_C.svg
Imas360_10.jpgDNAn+1_T.svg
Imas360_10.jpgDabrowskirynek.jpg
Imas360_10.jpgDalmenyhouse_lighter.jpg
Imas360_10.jpgEtaCarinae.jpg
Imas360_10.jpgEurope_location_ARM.png
Imas360_10.jpgFive-pointed_star.svg
Imas360_10.jpgFlag_of_Kentucky.svg
Imas360_10.jpgFont_Wallace_Pt_Pasteur.jpg
Imas360_10.jpgGeorgeWBush.jpg
Imas360_10.jpgGorillas_2609.jpg
Imas360_10.jpgHallingkast.jpg
Imas360_10.jpgHarlekin_Columbine_Tivoli_Denmark.jpg
Imas360_10.jpgHelicopter_rescue_sancy_takeoff.jpg
Imas360_10.jpgHerb_Korybut.jpg
Imas360_10.jpgHymenoptera_diagonal.jpg
Imas360_10.jpgIsleofWightmap_1945.jpg
Imas360_10.jpgJarzabczy_Wierch_a2.jpg
Imas360_10.jpgKarte_Lage_Kanton_Uri.png
Imas360_10.jpgKit_body_scga06.png
Imas360_10.jpgKościół_Wniebowstąpienia_Poznań003.jpg
Imas360_10.jpgLilium_bulbiferum_mg-k.jpg
Imas360_10.jpgMacaronesia.jpg
Imas360_10.jpgMaisonmaton.jpg
Imas360_10.jpgMap_of_Scotland_within_the_United_Kingdom.png
Imas360_10.jpgMarket_Square_Shopping_Centre_Geelong.jpg
Imas360_10.jpgMg-TableImage.svg
Imas360_10.jpgMichael_Boogerd.jpg
Imas360_10.jpgMonarch_caterpillar_and_egg.jpg

+----------------+-----------------------------------------------+
50 rows in set (0.07 sec)

The hashes are identical according to the query, so this suggests that something is very broken.

I've been told that null editing the pages can fix the hash, though it's difficult to test with replag.


Version: unspecified
Severity: major

Details

Reference
bz17057

Related Objects

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:27 PM
bzimport set Reference to bz17057.
bzimport added a subscriber: Unknown Object (MLST).

Re-uploading one of these files ([[File:5von10.png]]) over itself fixes this (null edit and purge don't). This means hashing was broken but isn't anymore, so it's fixable by 'just' recalculating them (is there a maintenance script for that?).

Adding to database cleanup tracking bug 16660.

I have been studying the hashes of commons images.
The errors can be sorted very clearly.
There're images where the hash of an older version got 'stuck'. The image sha1 wasn't updated on reuploading?
With the hash of the empty string it is much more common. It's normal that when the image had a broken version it got the empty hash, but it keeps on the current version, even through several uploads. Something which happens less with normal images.

There's however a worse case, where the metadata is right but the old version listed is not there. There's a file as history but its contents are the same as the current version (or another newer version). *The old version was silently lost*.
So not only should they be searched on backups, but we must make sure that whatever bug produced it is fixed.

Images using hash of older version:
http://commons.wikimedia.org/wiki/File:Agrigento-Domestic-Quarter-flickr.jpg Use hash of older version (77d4c7822a2f1e971d8cc7cf9b4b56a97cec9649), not theirs (ea8dff9bd8c3daeb33f8c40a4e8dfe2acf6db177)
http://commons.wikimedia.org/wiki/File:Agrigento-Temple-of-Concord-flickr-1.jpg ec985ed8bdf283f11e6b861d7fe0720d29142798 35d96c5d2b05f2ce81cd752ab862bf63dcd28964
http://commons.wikimedia.org/wiki/File:CD_F%C3%83%C2%A1tima.svg
http://commons.wikimedia.org/wiki/File:Cabo_Vil%C3%83%C2%A1n._Camari%C3%83%C2%B1as._Galiza.jpg
http://commons.wikimedia.org/wiki/File:Horchata_de_chufa.jpg
http://commons.wikimedia.org/wiki/File:Karina_Bacchi2.jpg
http://commons.wikimedia.org/wiki/File:Lid_Susa_Louvre_MAOS499.jpg
http://commons.wikimedia.org/wiki/File:NZ_Red_Admiral_%28Vanessa_gonerilla%29-4.jpg
http://commons.wikimedia.org/wiki/File:Vinbergs_kyrka.jpg

Hash of empty file:
Seems related to use hash of older file, all of them they have an older version missing
http://upload.wikimedia.org/wikipedia/commons/archive/5/5d/20090117210820%215von10.png
http://commons.wikimedia.org/wiki/File:Alfred_de_Musset.jpg 9e3753864bef9c18f8d75194136bfa71c440a2cd
http://commons.wikimedia.org/wiki/File:Blason-fr-en-Saint-Moreil.svg
http://commons.wikimedia.org/wiki/File:Clock_UT%2b7.png
http://commons.wikimedia.org/wiki/File:Codex_egberti_-_egbert.jpg
http://commons.wikimedia.org/wiki/File:Cross.png
http://commons.wikimedia.org/wiki/File:Cutty_sark_October_2003.jpg
http://commons.wikimedia.org/wiki/File:DNAn+1_C.svg
http://commons.wikimedia.org/wiki/File:DNAn+1_T.svg
http://commons.wikimedia.org/wiki/File:Dalmenyhouse_lighter.jpg
http://commons.wikimedia.org/wiki/File:EtaCarinae.jpg [but there's an intermediate version with right hash!]
http://commons.wikimedia.org/wiki/File:GeorgeWBush.jpg [intermediate existing versions]
http://commons.wikimedia.org/wiki/File:Hallingkast.jpg
http://commons.wikimedia.org/wiki/File:Harlekin_Columbine_Tivoli_Denmark.jpg
http://commons.wikimedia.org/wiki/File:Herb_Korybut.jpg
http://commons.wikimedia.org/wiki/File:IsleofWightmap_1945.jpg
http://commons.wikimedia.org/wiki/File:Jarzabczy_Wierch_a2.jpg
http://commons.wikimedia.org/wiki/File:Kit_body_scga06.png
http://commons.wikimedia.org/wiki/File:Lilium_bulbiferum_mg-k.jpg
http://commons.wikimedia.org/wiki/File:Maisonmaton.jpg
http://commons.wikimedia.org/wiki/File:Macaronesia.jpg
http://commons.wikimedia.org/wiki/File:Monarch_caterpillar_and_egg.jpg
http://commons.wikimedia.org/wiki/File:Montelbaanstoren_01.jpg
http://commons.wikimedia.org/wiki/File:Multi-colored_Wild_Lantana_Camara_3.JPG
http://commons.wikimedia.org/wiki/File:Ponte_de_Amizade_of_Macau.JPG
http://commons.wikimedia.org/wiki/File:Rhoen_montaner_Laubwald_mg-k.jpg
http://commons.wikimedia.org/wiki/File:Rosa_omeiensis_f._pteracantha_-_Bagatelle05.jpg
http://commons.wikimedia.org/wiki/File:STS_114_day_before_launch.jpg
http://commons.wikimedia.org/wiki/File:Space_Shuttle_Enterprise_747_takeoff.ogg
http://commons.wikimedia.org/wiki/File:Starr_Miconia_calvescens0.jpg
http://commons.wikimedia.org/wiki/File:TexasFM1950.png
http://commons.wikimedia.org/wiki/File:Voiceless_bilabial_plosive.ogg
http://commons.wikimedia.org/wiki/File:Volvo480doppel.jpg
http://commons.wikimedia.org/wiki/File:Wikinews_Brief_June_13,_2005_0500_UTC.ogg
http://commons.wikimedia.org/wiki/File:William_Phips_03.jpg
http://commons.wikimedia.org/wiki/File:Wind-power-small-scale.jpg

Images where the file storing the old version in fact contain a copy of the current one
http://upload.wikimedia.org/wikipedia/commons/archive/2/2e/20080920140529!Auguste_victoria_axb02.jpg File exists, but metadata shows us that current file is wrong (it's a copy of the smaller, current version)
http://upload.wikimedia.org/wikipedia/commons/archive/2/23/20080924123236!Banner_Porta_Westfalica.svg
http://upload.wikimedia.org/wikipedia/commons/archive/3/32/20081101222204!Beseda01.jpg
http://upload.wikimedia.org/wikipedia/commons/archive/5/59/20080111081832!Brakteat01.jpg
http://upload.wikimedia.org/wikipedia/commons/archive/8/83/20080731112730!Chirality.svg
http://upload.wikimedia.org/wikipedia/commons/archive/2/28/20071127075041!Cunningham%27s_skink444.jpg
http://upload.wikimedia.org/wikipedia/commons/archive/6/63/20080416202226!DowntownBoston.jpg
http://upload.wikimedia.org/wikipedia/commons/archive/7/79/20071126063925%21Flag_of_Italy_test.svg
http://upload.wikimedia.org/wikipedia/commons/archive/6/69/20081112225354!Florent_Gheeraert.jpg
http://upload.wikimedia.org/wikipedia/commons/archive/a/a5/20071010162601!Grabstein_Ey%C3%BCp_Bild-Giovanni_Dall%27Orto.jpg
http://upload.wikimedia.org/wikipedia/commons/archive/2/28/20080831090747!Hydro.jpg
http://upload.wikimedia.org/wikipedia/commons/archive/6/6c/20081111172337!Infirmiere_Nightingale.PNG
http://upload.wikimedia.org/wikipedia/commons/archive/3/31/20080920121617!Kecske-templom_01.jpg
http://upload.wikimedia.org/wikipedia/commons/archive/b/b1/20071124121519%21Northern_Ireland_election_seats_1997-2005-by.svg
http://upload.wikimedia.org/wikipedia/commons/archive/3/3f/20080916053014!Nuvola_Palestinian_flag.svg
http://upload.wikimedia.org/wikipedia/commons/archive/1/1f/20081111180135!PanoramaDobbiaco_b.jpg
http://upload.wikimedia.org/wikipedia/commons/archive/2/2e/20080924123129!Qcane.png
http://upload.wikimedia.org/wikipedia/commons/archive/1/12/20071124162924!Skrzypce_Adasia.JPG
http://upload.wikimedia.org/wikipedia/commons/archive/c/c6/20090116100543!Sled_dogs.jpgf
http://upload.wikimedia.org/wikipedia/commons/archive/a/af/20081102184103%21Template-question.svg
http://upload.wikimedia.org/wikipedia/commons/archive/c/cc/20071124025921!Titian-salome.jpg
http://upload.wikimedia.org/wikipedia/commons/archive/8/83/20081101223250!Trente-2.jpg
http://upload.wikimedia.org/wikipedia/commons/archive/0/0a/20080416201003!Wappen_von_Meinheim.png

http://upload.wikimedia.org/wikipedia/commons/archive/b/bf/20081203133833!Desfile_en_sello_coreano.jpg (has a copy of the now-old http://upload.wikimedia.org/wikipedia/commons/archive/b/bf/20080805193954!Desfile_en_sello_coreano.jpg)
http://upload.wikimedia.org/wikipedia/commons/archive/b/bc/20080811094312%21Escudo_de_Conil_de_la_Frontera.svg (has a copy of the now-old http://upload.wikimedia.org/wikipedia/commons/archive/b/bc/20081009101242%21Escudo_de_Conil_de_la_Frontera.svg)
http://upload.wikimedia.org/wikipedia/commons/archive/4/4f/20080731114904%21Escut_de_Vilanova_de_Segri%C3%A0.svg (has a copy of the now-old http://upload.wikimedia.org/wikipedia/commons/archive/4/4f/20080731115156%21Escut_de_Vilanova_de_Segri%C3%A0.svg)
http://upload.wikimedia.org/wikipedia/commons/archive/5/53/20080906084331!Flag_province_luxembourg.png (has a copy of... well, one of the subsequently uploaded image-warred files)
http://upload.wikimedia.org/wikipedia/commons/archive/1/19/20080305132344%21Gambia_b2.gif (has a copy of http://upload.wikimedia.org/wikipedia/commons/archive/1/19/20080305132205%21Gambia_b2.gif but the sha1 could have been calculated wrong and Ulamm have uploaded the same file 6 times instead of 5)
http://upload.wikimedia.org/wikipedia/commons/archive/9/90/20071129000654%21OSM_Pinelands_map.png (has a copy of http://upload.wikimedia.org/wikipedia/commons/archive/9/90/20080427091402%21OSM_Pinelands_map.png)
http://upload.wikimedia.org/wikipedia/commons/archive/7/71/20080416194942!V9938c_03.jpg (has a copy of http://upload.wikimedia.org/wikipedia/commons/archive/7/71/20081127155953!V9938c_03.jpg)

Uncategorized images with wrong hash (no relationship found)
File Real hash Hash in db
http://upload.wikimedia.org/wikipedia/commons/archive/d/d1/20071127182430!29_Calvin_Coolidge_3x4.jpg 7c782c9610e209ad1d7451c0c860c48b7155d69e 1a5bc3a057eb7afa0698ac69d733ae013884cab5 Image is 522px × 700px (like the original) and 259.58 KB. Metadata expects 299 KB and 573x764
http://upload.wikimedia.org/wikipedia/commons/archive/0/08/20080508200822!A-26K_609SOS_near_NKP_1969.jpg eb5d4e79ec2177c5b975d126c23f554ee03e9c01 572bb0970c82d8c838f4492c295a463f959673ee
http://upload.wikimedia.org/wikipedia/commons/archive/e/eb/20080919205313!Acylphosphate_rxn.svg bce9e697b369ad8e75aea2c797d4d7260a3c87dc e9ed844b62945f52896cfabfda5c191e2ad96165
http://upload.wikimedia.org/wikipedia/commons/archive/a/ad/20081112084634!Ambox_?.svg 874be09321df5f4c58dc2f644f3ec78d23146c32 98d88936056d48e53d0df15b49833de6bd03c94f
http://upload.wikimedia.org/wikipedia/commons/archive/7/70/20080416213951!BlankMap-Americas.svg 5e1d29245520f53d5e7e119322e2c27a2e6932f2 fb090aea4eccf4d805c2d96734bb6ab43946f22f
http://upload.wikimedia.org/wikipedia/commons/archive/4/45/20080528015643!Carnotaurus_DB_2.jpg 0250da19d2b0f443eebebac7df7dc0cdfbad15fc 5f0cd1d12e81a236f4667cf1ad79e51b3cd9bb0e
http://upload.wikimedia.org/wikipedia/commons/archive/4/45/20080527144757!Carnotaurus_DB_2.jpg 7cd01e305f28b213ade79c959207cf61a8ae1988 58292d741e728a4c2e91a39bb34e8ff0e86839f2
http://upload.wikimedia.org/wikipedia/commons/archive/2/20/20081112131447!Greenereyes.JPG e4e3defc609835267976d0ecfdeab684f8ff21f1 681ecfb07bd82e6c2d1dd154da9906b28ff1e47a
http://upload.wikimedia.org/wikipedia/commons/4/49/Hong_Kong_Science_Park_1.JPG 9c711494df2db9c7712de72414f47017a9a95da9 577b8e5008abb1885fbe3714f05aeeb66eb78559
http://upload.wikimedia.org/wikipedia/commons/archive/d/d8/20090112205355%21Old_town_zamo%C5%9B%C4%87_plan.png b2373b572435aa7a5d80ff68efdbc460b7202af5 2e63fa192711b72ef96889c5f2fec18aef7d01d3
http://upload.wikimedia.org/wikipedia/commons/0/05/Rusta_J%C3%B6nk%C3%B6ping.jpg 6d2504f4812abdbc9710084cf416b11a8306f7fa ca8b228aad26e078ed6e8d6a1ca200e913b18787
http://upload.wikimedia.org/wikipedia/commons/archive/9/9b/20080228002419!Tom_Savini_02.JPG e281efce6e07541a1bbff678fa6a69ba659ff585 9f58162ec5f9480fa09b0f27204ccecf8848dc58 It's a different image than expected, metadata lists 124×164 28KB but it's 204x262 72.79KB (also not matching the other uploaded version, it's a different crop)
http://upload.wikimedia.org/wikipedia/commons/archive/8/83/20071018115142!Wednesbury_Canal_Map_SO99SE.svg bbefc73aba02560f8fd809b7ac6dc77ec2d54cb9 a32fa8163ca5adbb8d2c77cf25f0da031aec3067 File size is 44144, metada says 46418 (but doesn't look truncated).
http://upload.wikimedia.org/wikipedia/commons/0/0a/White_Knight_Two.png b3d712f471338f330a4fc618874232b2cd4498a1 1225ffb95b7260fc8c5afadfe0adf7af6919023e

Other:
http://upload.wikimedia.org/wikipedia/commons/archive/b/ba/20081112133435%21Maltipoo_hen%3F.jpg
Result is html saying "404 Wikimedia page not found:" but header is "HTTP/1.0 200 OK Content-Type: image/jpeg" (should be purged)

(In reply to comment #2)

Adding to database cleanup tracking bug 16660.

Can populateSha1 be run again in the meantime?

populateSha1 wouldn't fix anything, since it only works on files which don't have hash in the db. These files do have a hash, although it's wrong.

Why not just make ?action=purge re-calculate the hash? Is it too expensive or something?

I don't think so. It's simply that the hash wasn't expected to be wrong.
But note that sometimes the wrong hash is on an old image version. So you
would need to iterate all images recaulculating its hash (luckily they are
usually few image versions).
I would prefer seeing an API module for doing purges.
Or simply seeing the sysadmins purge those entries.

(In reply to comment #5)

populateSha1 wouldn't fix anything, since it only works on files which don't
have hash in the db. These files do have a hash, although it's wrong.

Adding an overwrite mode sounds trivial.

(In reply to comment #8)

(In reply to comment #5)

populateSha1 wouldn't fix anything, since it only works on files which don't
have hash in the db. These files do have a hash, although it's wrong.

Adding an overwrite mode sounds trivial.

Running a maintenance script every time a particular image has a hash issue is impractical. There should be a way to re-generate the hash without requiring a re-upload or command-line access. Regardless of whether the maintenance script is adjusted (which it probably should be).

(In reply to comment #9)

(In reply to comment #8)

(In reply to comment #5)

populateSha1 wouldn't fix anything, since it only works on files which don't
have hash in the db. These files do have a hash, although it's wrong.

Adding an overwrite mode sounds trivial.

Running a maintenance script every time a particular image has a hash issue is
impractical. There should be a way to re-generate the hash without requiring a
re-upload or command-line access. Regardless of whether the maintenance script
is adjusted (which it probably should be).

Obviously.

But we can use to retroactively fix the numerous wrong values, once the reason for the keys getting stuck is found. A sha-1 purge link shouldn't be needed unless something is just flat broken...users shouldn't be expected to deal with that. It could be a temporary stop-gap solution if all else fails though...

We can now fix this for individual broken cases as of r54328. Underlying cause of why they're wrong might need fixing still?

(In reply to comment #11)

We can now fix this for individual broken cases as of r54328. Underlying cause
of why they're wrong might need fixing still?

More script updates in r112736.

A race condition involving a lack of locking was fixed (which previously allowed mixed up metadata for two rows).

*** Bug 49841 has been marked as a duplicate of this bug. ***

*** Bug 17070 has been marked as a duplicate of this bug. ***

A script was run to fix all of these image/oldimage rows (completed April 29).