Page MenuHomePhabricator

Mediawiki losing old file versions upon undeletion in MW 1.18
Closed, DeclinedPublic

Description

Since the most recent MW version upgrade, I've now had Mediawiki lose two different old versions of a file while in the process of deletion. Let me explain:

  • User uploads version A to Myfile.jpg.
  • The same user or a different user uploads version B to Myfile.jpg, overwriting the old version.
  • I delete Myfile.jpg.
  • I go to undelete version A, but when I undelete the file, version B pops up.

Looking in the file history, version B is now version A, even though the resolution information (and IIRC the sha1 information) are still different. Thus version A is forever gone.

You can see this occur at two files:

Do I need to file a bug for this? Or is there already a bug filed?


Version: 1.18.x
Severity: critical

Details

Reference
bz31792

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:54 PM
bzimport set Reference to bz31792.

(In reply to comment #0)

Do I need to file a bug for this? Or is there already a bug filed?

Well, you filed one. ;)

I don't think this has been reported yet.

Oops! I copy/pasted my post from the English Wikipedia village pump, and forgot to take out that part. :)

Marked as critical due to potential 'loss' of content.

This is marked as critical, highest priority, and sounds like a data loss problem and a 1.18 regression.

Should this be assigned to someone by the bugmeister perhaps?

I have done file history splits in commons at 1.18 without such bug popping out.
I suppose it wasn't your browser cache playing tricks on you?

Bryan.TongMinh wrote:

Just tried this, can't reproduce.

lowering priority since we haven't been able to reproduce this yet. May remove 1.18 milestone

Old examples from IRC, but may not be something that is happening in
the current code:

<Saibo> hexmode: have three example files (may have different reasons):
<Saibo>

https://de.wikipedia.org/wiki/Wikipedia:Redaktion_Bilder/Archiv/2011/2#Alte_Bildversionen_weg.3F
→ http://commons.wikimedia.org/wiki/File:AlleeR%C3%BCgen1.jpg the two
old file versions are not available

<Saibo>

https://commons.wikimedia.org/wiki/Commons:Forum/Archiv/2011/March#Dateiversionsschwund

<Saibo> →

https://commons.wikimedia.org/wiki/File:Wappen_des_Landkreises_Donau-Ries.png
same  Someone from wikimedia-tech channel looked to find the old
versions - but wasn't successful (in Aptril 2011).

<Saibo> → (a deleted file)

https://de.wikipedia.org/wiki/Spezial:Wiederherstellen/Datei:Fliegenk%C3%A4fer01.JPG
old, original file version is gone available - server delivers the
newer (smaller) file version instead

Also note that Saibo's examples are all pre-1.18

http://commons.wikimedia.org/wiki/File:AlleeR%C3%BCgen1.jpg and
https://commons.wikimedia.org/wiki/File:Wappen_des_Landkreises_Donau-Ries.png have lost revisions, but they were never deleted.

For Fliegenkäfer01.JPG, you mean that the 20070806 version is not available and you get the new one instead? They have different storage keys, so that shouldn't happen.
Similarly for deleted University.JPG and Sindh_Agriculture_University.JPG, the keys are different.

On the other hand, Kolkata_Tipu_Sultan's_Mosque3.jpg indeed has two versions with the same storage key, so one of them is lost.
I think this can be a consequence of the "image getting wrong hash" bug I reported on bug 17057#c3 (this wrong data would only produce dataloss once the file is deleted).

It doesn't happen every time; most times it doesn't happen in fact. But you will note from my example above: the SHA1 information is off, as is the pixelage argument, so you can tell I'm not just being loopy. This is in fact a bug, unless I'm badly mistaken.

(In reply to comment #11)

This is in fact a bug, unless I'm badly mistaken.

It is a bug. If it is currently happening, I need examples and, preferably, a way to reproduce this. Please contact me on IRC (I'm hexmode in #wikimedia-dev, http://webchat.freenode.net/?channels=wikimedia-dev) so that we can track this down.

Aaron is going to take a shot at reproing this one.

I recommend adding a check at deletion time, when is it storing the files with the hash. If it already exists, verify that the filesizes match. If they don't refuse the deletion.
That should block most dataloss, and is easy to check (manually corrupt the sha1 entry at the db).

@mark H.: I can't reproduce it. It's only happened a few times. Sorry.

@Platonides: It should be noted be noted that some hash or EXIF are *wrong* or *corrupt* on Wikipedia. I believe in happened somewhere in the 2008/9 range; unsure if the wrong data is linked with the large amount of files lost in that time period. Anyway, it's rare, but it does happen, so if a check is done, it should be done beforehand (to make sure the data is clean), and afterward (to make sure it's still clean).

Yes, we have wrong hashes (see bug 17057). If there are two different images with the same hash, one of them or both is broken, so we should at least abort and force manual intervention.

Aaron spent the better part of the day trying to repro this and look through the code, but isn't very close to solving this one. He's going to keep at it, but we're not going to let this block 1.18.

magog.the.ogre, can you describe what happened with Me1.jpg? There's nothing obviously wrong just looking at then information we see.

Dropping priority while waiting for response

Sorry about that (long response). Again, this was a page with multiple versions in history, so I performed deletion and undeletion in order to bring about a split.

You will notice that the page currently has the image which is now at http://commons.wikimedia.org/wiki/File:Jasrasr_userphoto.jpg. If you look through the deleted history, you will see that version uploaded three times by User:Jasrasr, all at 80x115 7396 bytes.

Now you will see an upload by MrBillTheThrill at exactly the same resolution and size as the most recent in the history. I am 85% sure that MrBill did not mean to upload an old version of that image, and that he didn't. I take my evidence because a) it would make no sense to do so in light of this edit: http://en.wikipedia.org/w/index.php?title=Carrickfergus_Grammar_School&diff=prev&oldid=120379042, b) I would probably remember it like this, and c) the page on English Wikipedia is mysteriously not reporting the duplicate on Commons under "File usage", which it usually does when they have the same hash. I will feel like an idiot if I'm wrong, but I don't think I am.

Hash value reported: 3710e894f0a9a2f0d9dcbfd990aea07656100461 (per http://en.wikipedia.org/w/api.php?action=query&titles=File:Me1.jpg&prop=imageinfo&iiprop=sha1|user|size&iilimit=max)
Correct hash value: 3710e894f0a9a2f0d9dcbfd990aea07656100461 (per http://en.wikipedia.org/w/api.php?action=query&titles=File:Jasrasr_userphoto.jpg&prop=imageinfo&iiprop=sha1|user|size&iilimit=max)

I imagine this problem would disappear if someone were to purge the page at English Wikipedia. I am not going to do that though because I don't want to bug up the results for everyone else to see.

I agree. Jasrasr originally uploaded it as Me1.jpg. Then the later upload wrongly got the same hash as the previous version (I don't know why, but have seen it on many files).

It can be seen how it's wrong by looking at the reported filesize (73 KB) and the size of the served image (7396 bytes = 7,2Kb)

Google cache provides slightly more data http://webcache.googleusercontent.com/search?q=cache:WEBIjZs8xiQJ:en.wikipedia.org/wiki/File:Me1.jpg

Date/Time Dimensions User Comment
20:51, 1 March 2008 80 × 115 (7 KB) Jasrasr (talk | contribs) (Reverted to version as of 05:28, 5 July 2006)

20:50, 1 March 2008 80 × 115 (7 KB) Jasrasr (talk | contribs) (Reverted to version as of 05:28, 5 July 2006)

01:09, 5 April 2007 600 × 450 (73 KB) MrBillTheThrill (talk | contribs) (Gareth Buchanan of Year 13 Thornfield performs at Pop Act 2005)

11:02, 21 September 2006 640 × 427 (265 KB) Sajidn (talk | contribs)

05:28, 5 July 2006 80 × 115 (7 KB) Jasrasr (talk | contribs) (Me)

21:21, 20 April 2006 114 × 152 (3 KB) Jdib84 (talk | contribs) (I took this picture myself for my own personal page.)

We can see that the upload of MrBillTheThrill was 600 × 450 (73 KB)

A few more data:
hex sha1: 3710e894f0a9a2f0d9dcbfd990aea07656100461
base36 sha1: 6fkaqblfccxi5egkgxcypzthf9d89r5

Old image entry, recovered from enwiki-20111201-image.sql.gz

('Me1.jpg',7396,80,115,'a:20:{s:4:\"Make\";s:9:\"Panasonic\";s:5:\"Model\";s:13:\"PV-GS50 \";s:11:\"Orientation\";i:1;s:11:\"XResolution\";s:4:\"72/1\";s:11:\"YResolution\";s:4:\"72/1\";s:14:\"ResolutionUnit\";i:2;s:8:\"DateTime\";s:19:\"2004:09:03 20:01:51\";s:16:\"YCbCrPositioning\";i:2;s:12:\"ExposureMode\";i:0;s:12:\"WhiteBalance\";i:0;s:16:\"SceneCaptureType\";i:0;s:12:\"ExposureTime\";s:4:\"1/60\";s:7:\"FNumber\";s:5:\"18/10\";s:11:\"ExifVersion\";s:4:\"0220\";s:16:\"DateTimeOriginal\";s:19:\"2004:09:03 20:01:51\";s:17:\"DateTimeDigitized\";s:19:\"2004:09:03 20:01:51\";s:22:\"CompressedBitsPerPixel\";s:5:\"34/10\";s:5:\"Flash\";i:0;s:10:\"ColorSpace\";i:1;s:22:\"MEDIAWIKI_EXIF_VERSION\";i:1;}',8,'BITMAP','image','jpeg',
'Reverted to version as of 05:28, 5 July 2006',1702380,'Jasrasr','20080301205128','1f94p5ba6ewoybkhsr81t5otovi7ni7')

It's very interesting the sha1 of 1f94p5ba6ewoybkhsr81t5otovi7ni7, which corresponds to c302a907571f352105b726f4c314a5e937f60bf in hex.

There's an entry for that file in the deleted history of Me1.jpg, so it should be possible to restore it. What does it contain?

Looking at enwiki-20111201-image.sql.gz:

('Me1.jpg','20060705052850!Me1.jpg',3009,114,152,8,'I took this picture myself for my own personal page.',1290829,'Jdib84','20060420212137','0','BITMAP','image','jpeg',0,'ffifecytvu4rct5an5rzj56q0bo641e')

('Me1.jpg','20060921110230!Me1.jpg',7396,80,115,8,'Me',1702380,'Jasrasr','20060705052850','a:20:{s:4:\"Make\";s:9:\"Panasonic\";s:5:\"Model\";s:13:\"PV-GS50 \";s:11:\"Orientation\";i:1;s:11:\"XResolution\";s:4:\"72/1\";s:11:\"YResolution\";s:4:\"72/1\";s:14:\"ResolutionUnit\";i:2;s:8:\"DateTime\";s:19:\"2004:09:03 20:01:51\";s:16:\"YCbCrPositioning\";i:2;s:12:\"ExposureMode\";i:0;s:12:\"WhiteBalance\";i:0;s:16:\"SceneCaptureType\";i:0;s:12:\"ExposureTime\";s:4:\"1/60\";s:7:\"FNumber\";s:5:\"18/10\";s:11:\"ExifVersion\";s:4:\"0220\";s:16:\"DateTimeOriginal\";s:19:\"2004:09:03 20:01:51\";s:17:\"DateTimeDigitized\";s:19:\"2004:09:03 20:01:51\";s:22:\"CompressedBitsPerPixel\";s:5:\"34/10\";s:5:\"Flash\";i:0;s:10:\"ColorSpace\";i:1;s:22:\"MEDIAWIKI_EXIF_VERSION\";i:1;}','BITMAP','image','jpeg',0,'6fkaqblfccxi5egkgxcypzthf9d89r5')

('Me1.jpg','20070405010944!Me1.jpg',271384,640,427,8,'',2160909,'Sajidn','20060921110230','0','BITMAP','image','jpeg',0,'0b04y9ng82yxw5tiszewt3q8aj5r48v')

('Me1.jpg','20080301205046!Me1.jpg',74417,600,450,8,'Gareth Buchanan of Year 13 Thornfield performs at Pop Act 2005',2921689,'MrBillTheThrill','20070405010944','a:29:{s:4:\"Make\";s:4:\"SONY\";s:5:\"Model\";s:9:\"MVC-CD500\";s:11:\"Orientation\";i:1;s:11:\"XResolution\";s:12:\"720000/10000\";s:11:\"YResolution\";s:12:\"720000/10000\";s:14:\"ResolutionUnit\";i:2;s:8:\"Software\";s:27:\"Adobe Photoshop CS2 Windows\";s:8:\"DateTime\";s:19:\"2006:02:10 21:34:45\";s:16:\"YCbCrPositioning\";i:2;s:12:\"ExposureTime\";s:6:\"10/500\";s:7:\"FNumber\";s:5:\"25/10\";s:15:\"ExposureProgram\";i:2;s:15:\"ISOSpeedRatings\";i:100;s:11:\"ExifVersion\";s:4:\"0220\";s:16:\"DateTimeOriginal\";s:19:\"2005:12:20 10:34:52\";s:17:\"DateTimeDigitized\";s:19:\"2005:12:20 10:34:52\";s:22:\"CompressedBitsPerPixel\";s:3:\"4/1\";s:17:\"ExposureBiasValue\";s:4:\"0/10\";s:16:\"MaxApertureValue\";s:5:\"33/16\";s:12:\"MeteringMode\";i:5;s:11:\"LightSource\";i:0;s:5:\"Flash\";i:13;s:11:\"FocalLength\";s:6:\"158/10\";s:10:\"ColorSpace\";i:1;s:14:\"CustomRendered\";i:0;s:12:\"ExposureMode\";i:0;s:12:\"WhiteBalance\";i:0;s:16:\"SceneCaptureType\";i:0;s:22:\"MEDIAWIKI_EXIF_VERSION\";i:1;}','BITMAP','image','jpeg',0,''),

('Me1.jpg','20080301205128!Me1.jpg',7396,80,115,8,'Reverted to version as of 05:28, 5 July 2006',1702380,'Jasrasr','20080301205046','a:20:{s:4:\"Make\";s:9:\"Panasonic\";s:5:\"Model\";s:13:\"PV-GS50 \";s:11:\"Orientation\";i:1;s:11:\"XResolution\";s:4:\"72/1\";s:11:\"YResolution\";s:4:\"72/1\";s:14:\"ResolutionUnit\";i:2;s:8:\"DateTime\";s:19:\"2004:09:03 20:01:51\";s:16:\"YCbCrPositioning\";i:2;s:12:\"ExposureMode\";i:0;s:12:\"WhiteBalance\";i:0;s:16:\"SceneCaptureType\";i:0;s:12:\"ExposureTime\";s:4:\"1/60\";s:7:\"FNumber\";s:5:\"18/10\";s:11:\"ExifVersion\";s:4:\"0220\";s:16:\"DateTimeOriginal\";s:19:\"2004:09:03 20:01:51\";s:17:\"DateTimeDigitized\";s:19:\"2004:09:03 20:01:51\";s:22:\"CompressedBitsPerPixel\";s:5:\"34/10\";s:5:\"Flash\";i:0;s:10:\"ColorSpace\";i:1;s:22:\"MEDIAWIKI_EXIF_VERSION\";i:1;}','BITMAP','image','jpeg',0,'6fkaqblfccxi5egkgxcypzthf9d89r5'),

The value in the db for the sha1 of MrBillTheThrill image was ''. I wonder if a purge would load it with the sha1 of the *current* image.

The data loss was inadvertently fixed in r108886. The deletion will simply fail in the case of two different files wrongly have the same SHA-1 in the DB. This basically does what comment #16 mentioned.

On a related note, I also noticed that LocalFile::lock() doesn't actually lock anything (no FOR UPDATE)...

Interestingly, this time it isn't letting me undelete the old version; it appears the error check you guys put it did stop it from doing that BUT it didn't stop the software from actually losing the file itself. :(

(In reply to comment #26)

Seems to have happened again with
http://en.wikipedia.org/wiki/Special:Undelete/File:MaastrichtStreet.JPG

23:42, 15 September 2006 . . GK tramrunner (talk | contribs | block) 1,024 × 768 (265,004 bytes) (One of the streets in Maastricht)

That is the only file that should be different. Are you sure that it wasn't broke before it was deleted?

No, I'm not sure; my fault for reopening.

(In reply to comment #29)

No, I'm not sure; my fault for reopening.

I noticed that they had different storage keys, meaning that FileRepo mapped the old file versions to different deleted file names.

This bug is about were two different files get mapped to the same deleted file name, which previously caused data loss, since only one of them "won" and the other was just erased.

Gilles raised the priority of this task from Medium to Unbreak Now!.Dec 4 2014, 10:29 AM
Gilles moved this task from Untriaged to Done on the Multimedia board.
Gilles lowered the priority of this task from Unbreak Now! to Medium.Dec 4 2014, 11:21 AM