Page MenuHomePhabricator

Inconsistency bug in file re-upload (possible race condition) potentially causing data loss
Closed, ResolvedPublic

Description

Denniss pointed me to https://commons.wikimedia.org/wiki/?curid=28611474

Basically, the original version of the file is missing (oi_archive_name empty string). On top of that, the two old versions have the exact same upload timestamp down to the second, which seems unlikely to be correct (unless both uploaded by uploadwizard at same time, using chunked upload or upload stash maybe?).

The img_metadata looks similar, except orientation field is different, the files have different sha1's and different sizes, which would support the hypothesis that user saw file looked wrong, rotated in an image manipulation program, and re-uploaded (however that doesn't follow with the timestamp).

So weird things happening here:
*Original file is gone
*The second version has the exact same img_timestamp as the first version
Furthermore log_timestamps are the same too. The log_ids are such that it appears the timestamps are correct
*page_id in the log entry for the second version, is incorrect (It is 0, which could indicate something wrong, perhaps with slave lag. Normally the logging code does a db query by itself to determine what log_page to use, and I believe it hits a slave for that)
*There is no dummy edit to the file description page for the first re-upload
*Hard to be sure, but it appears the original file was visible for a long time (At least 23 hours after the second version was uploaded, as this was when it was tagged as needing rotation). Then rotate bot comes along, rotates things, but it rotates the wrong 2nd version, which is already rotated
Possibly caused by a race condition with varnishes - the re-upload sends cache purges but they happen before the first image is rendered, and the thumbnail gets stuck in varnish cache. (I think the issue where the original file is gone, is more important than the thumbnails stuck in cache, assuming that is what is happening here)

Possible theory: User uses upload wizard to upload all the files in some folder. This person accidentally selects the non-rotated version, and then later selects the rotated version, in the same batch, and puts both as they should have the same filename. Upload wizard uploads both, and tries to publish both to the same file name. Since the long part of the upload already has taken place (since its stashed), only the quick publish step has to happen, which can happen at the same time. Race conditions in filebackend causes data loss of original file.


Version: 1.22.0
Severity: critical
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=54750

Details

Reference
bz54736

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:24 AM
bzimport set Reference to bz54736.
bzimport added a subscriber: Unknown Object (MLST).

p.s. Denniss's comment at https://commons.wikimedia.org/w/index.php?title=User_talk:Bawolff&oldid=105637956#Storage.2FBackend_issues.3F suggests that this is not just a one-off issue. I'll repeat his original comment here for the record:

"While monitoring the Rotatebot log I often come around images with storage issues; the first image revision is invalid but not completely broken as the second working image seems to gather wrong rotation info from it, so does Rotatebot and in the end the images gets rotated wrong. Deleting the file and restoring the first working image (usually the first but sometimes the second revision is shown as invalid) shows in most cases an image without rotation problem. Do you have an idea why this data corruption happens? Almost all of these image have the same storage time recorded, maximum one minute difference between the two. Example here. --Denniss (talk) 09:45, 28 September 2013 (UTC)"

I think the storage problem for two revision of the same file uploaded at the same time is an old, known one. (but should be fixed anyway)

The "old-version shown 23 hours later" looks like a cache problem (not strange given that the two versions had identical timestamps, to the second)

The UW is a good theory.

New file to check: https://commons.wikimedia.org/wiki/File:Cuenca,_casas_colgantes.JPG

First image version invalid, second working with thumb upright but fullres rotated. Uploader requested rotation ~100 mins after upload and rotated image is off by 90 degrees.

Confirmed, if you upload two separate files with the uploadwizard that you say have the same title, bad things happen ( https://test.wikipedia.org/wiki/File:BDavis-test-del.jpg ).

Suggested work around for the near term: Have uploadwizard bail out in this situation. Its very unlikely if the user is doing this, that they really want the file. Filed as bug 54750.

Additionally we should actually fix the race condition.

(In reply to comment #4)

Confirmed, if you upload two separate files with the uploadwizard that you
say
have the same title, bad things happen (
https://test.wikipedia.org/wiki/File:BDavis-test-del.jpg ).

At the same time or far apart too?

(In reply to comment #5)

(In reply to comment #4)

Confirmed, if you upload two separate files with the uploadwizard that you
say
have the same title, bad things happen (
https://test.wikipedia.org/wiki/File:BDavis-test-del.jpg ).

At the same time or far apart too?

At the same time.

Change 87010 had a related patch set uploaded by Aaron Schulz:
Make sure LocalFile::lock() works for brand new uploads

https://gerrit.wikimedia.org/r/87010

Change 87010 merged by jenkins-bot:
Make sure LocalFile::lock() works for brand new uploads

https://gerrit.wikimedia.org/r/87010

Patch got merged, can this be closed as RESOLVED FIXED?

Patch is live on test. Someone should do the steps to reproduce to make sure bug is actually fixed. (I can't at the moment)

(In reply to comment #10)

Patch is live on test. Someone should do the steps to reproduce to make sure
bug is actually fixed. (I can't at the moment)

Tested, it works fine now, both files succesfully uploaded


However, the API publish request (the action=upload&filekey=foo&... request) for one of the files does time out, eventually returning a squid error "Request: POST http://test.wikipedia.org/w/api.php, from 208.80.154.9 via cp1010.eqiad.wmnet (squid/2.7.STABLE9) to 10.64.0.123 (10.64.0.123)<br/>
Error: ERR_READ_TIMEOUT, errno [No Error] at Fri, 04 Oct 2013 14:52:03 GMT"

Interestingly enough, the timestamp from that error is precisely when in the log the second version of the file is uploaded (14:52:03), which is kind of odd coincidence if it was just uploading both in sequence took longer than the squid timeout. I wonder if some lock isn't being released for the second one until after the user disconnects.

Anyhow, calling this bug fixed. The second part might warrant further investigation, but is a much lower priority issue - splitting to bug 54978

Gilles raised the priority of this task from High to Unbreak Now!.Dec 4 2014, 10:21 AM
Gilles moved this task from Untriaged to Done on the Multimedia board.
Gilles lowered the priority of this task from Unbreak Now! to High.Dec 4 2014, 11:20 AM