Page MenuHomePhabricator

text of revisions in the archive table that were deleted before Wikipedia started using MediaWiki 1.5 is corrupt
Closed, DeclinedPublic

Description

I was checking through deleted revisions in the main namespace by Conversion script on the English Wikipedia, to find old deleted edits to history merge:
http://en.wikipedia.org/w/index.php?limit=500&title=Special%3ADeletedContributions&target=Conversion+script&namespace=0

I found that in all pages deleted before Wikipedia was upgraded to MediaWiki 1.5 (late June 2005), all edits besides the latest one are corrupt. An undeleted example of these edits can be found above; the edits were previously at the title "Clearwater River, Idaho", and I history merged them to the existing article "Clearwater River(Idaho)". Another example involves the page about Michael Collins:
http://en.wikipedia.org/w/index.php?title=Michael_Collins&dir=prev&limit=6&action=history

The edits were previously at the title "Michael Collins (disambiguation)".

Even though 99.9% of the text in these old deleted archives is garbage, the other 0.1% is very important page history and it should not be corrupted.


Version: unspecified
Severity: major
URL: http://en.wikipedia.org/w/index.php?title=Clearwater_River_(Idaho)&dir=prev&limit=16&action=history

Details

Reference
bz19990
TitleReferenceAuthorSource BranchDest Branch
Update ICWSM papersrepos/sre/miscweb/research-landing-page!25isaacjupdate-icwsm-refmaster
Replace Twitter CTA with RAY CTArepos/sre/miscweb/research-landing-page!24daniadd-ray-btnmaster
Update Teamrepos/sre/miscweb/research-landing-page!13daniupdate-teammaster
Add latest paper to "Recent updates"repos/sre/miscweb/research-landing-page!10daniupdate-23-novmaster
Add paper to Knowledge Integrityrepos/sre/miscweb/research-landing-page!6daniupdate-23-oct-2master
Replace remaining hiring CTArepos/sre/miscweb/research-landing-page!5danirm-hiring-cta-2master
October updatesrepos/sre/miscweb/research-landing-page!4danioct-updatemaster
Remove hiring call to actionrepos/sre/miscweb/research-landing-page!3danirm-hiring-ctamaster
Make September updatesrepos/sre/miscweb/research-landing-page!2dani23-sep-updatemaster
Customize query in GitLab

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 10:42 PM
bzimport set Reference to bz19990.

Possible external storage issue? Looks like something not getting un-gzipped or losing its flags.

I'm not sure if this is related, but some revisions before June 2005 are completely blank when they shouldn't be, as reported at this discussion on the technical village pump:

http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_62#Revision content disappeared

I didn't think much of it at the time, but both problems seem to involve Wikipedia text added before the upgrade to MediaWiki 1.5.

These deleted revisions from before June 2005 are fine:
http://en.wikipedia.org/wiki/Special:Undelete/Braille_music

They should stay deleted, since they were obviously nuked to make way for a page move.

This should be fixed in r55626.

It's fixed in the archive table where the MW 1.4 deleted revisions are.

However the undeleted edits to "Clearwater River (Idaho)" and "Michael Collins" that I mentioned above are still corrupt. I tried deleting and undeleting them, just in case, and that didn't fix the issue. I highly doubt there are many other revisions with this problem.

I'm not sure of proper protocol here : whether to re-open this bug, or start a new one ...

(In reply to comment #6)

It's fixed in the archive table where the MW 1.4 deleted revisions are.

However the undeleted edits to "Clearwater River (Idaho)" and "Michael Collins"
that I mentioned above are still corrupt. I tried deleting and undeleting them,
just in case, and that didn't fix the issue. I highly doubt there are many
other revisions with this problem.

I'm not sure of proper protocol here : whether to re-open this bug, or start a
new one ...

Anything that was undeleted while the bug was active will now be permanently corrupted and will need to fixed manually.

Yikes, I thought as much. So ... what happens with this bug? The underlying issue is resolved but it's still caused damage that's seemingly hard to fix.

The only way to fix it is to update each corrupted row in the database, e.g. by adding manually "gzip" in the old_flags field. The problem is that it'd be very difficult to find the affected revisions automatically.

Then I'd like someone to fix the revisions I mentioned above:
http://en.wikipedia.org/w/index.php?title=Clearwater_River_(Idaho)&dir=prev&limit=16&action=history

and:
http://en.wikipedia.org/w/index.php?title=Michael_Collins&dir=prev&limit=6&action=history

As for finding other cases where it happened, for the English Wikipedia, check whether the revision ID is greater than 296,365,718 and the revision date is before July 2005, so when MW 1.4 was used. I use a revision ID of 296365718 because it's the last uncorrupted revision that I know of which was deleted that could've had this problem, see this diff:
http://en.wikipedia.org/w/index.php?title=User:Xaonon&diff=2406956&oldid=296365718

As far as I know, this would work because before MW 1.5 was used, a revision got a new rev_id when it was undeleted.

Tim, do you think this is something that still can and should be recovered or just close as WONTFIX?

Realistically closing this as WONTFIX nowadays.