Page MenuHomePhabricator

Corruption of archive text due to deletion in late 2004
Closed, DeclinedPublic

Description

This is a bug I'm tracking down and fixing, I'm putting it here so I have a place for notes and something to refer to.

CGZ compression was first committed in October 2004, r5940. In December 2004, r6640, this bug was discovered and a temporary fix put in place. Apparently nobody submitted it to Bugzilla at the time.

The issue was that the deletion UI was blind to the compression scheme, and was causing CGZ blobs and pointers to be moved into the archive table. Undeletion would move them back. Pointers to deleted rows cannot work and will give you an error message, so the text of these pointers is unreadable. If the whole article was undeleted, the CGZ blob would get a different old_id, which means that the pointers still don't work.

If the article was partially undeleted, then you could have pointers which point to deleted rows.

However, undeleted CGZ rows would still give you their default text, which left them open to subsequent irreversible corruption by recompressTracked.php, which may have deleted some of these CGZ blobs, replacing them with a pointer to the primary text only.

The subsequent fixes (r6640, r8983) only fixed the text corruption at the source (i.e. deletion). Apparently no script was run to fix corrupted archive rows or undeleted text rows.

Some archive rows even have pointers to external storage, apparently moved in from old/text via the same bug.

The reason this is coming up now is that there are a fair few revisions which are either accessible (CGZ default text), or inaccessible but recoverable (CGZ pointers), which are now at risk of being lost permanently due to recompressTracked.php.

The basic plan of action is to compile a list of content hashes in affected CGZ blobs, and to match them up with broken pointers by comparing those content hashes.

I may be able to take this opportunity to normalise the entire archive table, by converting archive rows to the MW 1.5+ format, with a non-null ar_text_id, and blank ar_text and ar_flags. This will free up core database space and allow the deleted text to be recompressed.


Version: 1.4.x
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=34925

Details

Reference
bz22624

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:05 PM
bzimport set Reference to bz22624.
bzimport added a subscriber: Unknown Object (MLST).

(copying dfrom wikitech)

I may be able to take this opportunity to normalise the entire archive table,
by converting archive rows to the MW 1.5+ format, with a non-null ar_text_id,
and blank ar_text and ar_flags. This will free up core database space and allow
the deleted text to be recompressed.

What are the chances of moving them back to revision and use revdelete for all deletions (removing archive table)?
See bug 18104, bug 21279, bug 18780

trackBlobs.php refers to a normaliseArchiveTable.php script, which I could not find in core or in WikimediaMaintenance. Has this script not yet been written?

pawanseerwani+bugzilla wrote:

Hi,
I am working on related issue ie. Bug 34925.

All I understand is some data is already corrupted in archives tables and before solving Bug 34925, this bug is to be solved.

So I might as well solve this bug first.

But in my repository( which is wikimedia 1.23), none of the following files exist in wikimedia/maintenance folder

  1. recompressTracked.php
  2. trackBlobs.php
  3. normaliseArchiveTable.php

So can someone tell me how do I solve this bug?

(In reply to comment #3)

All I understand is some data is already corrupted in archives tables and
before solving Bug 34925, this bug is to be solved.

As stated in comment 0 ("description"), it would perhaps be most efficient to clean up the database corruption while normalizing the archive table, because doing so requires copying data into the text table anyway (and possibly into external storage, if that is where the text should end up).

So I might as well solve this bug first.

But in my repository( which is wikimedia 1.23), none of the following files
exist in wikimedia/maintenance folder

  1. recompressTracked.php
  2. trackBlobs.php
  3. normaliseArchiveTable.php

The first two exist in a subfolder of maintenance -- maintenance/storage. The third is the maintenance script you were trying to write ("textMigration.php"), though with a command-line option for fixing this bug.

So can someone tell me how do I solve this bug?

This isn't a particularly easy bug to fix.

MediaWiki's text storage subsystem is poorly documented, and there have been various bugs over the years (including this one!) that need to be accounted for.

Testing is a bit tricky. You would have to set up PHP4 in order to install a buggy revision of MediaWiki 1.4 (which is not compatible with PHP5). You would have to

create, edit, and delete some pages, and run a

create, edit, and delete some pages, and run a

[Sorry, I accidentally hit "Save Changes" before I was done typing my comment. It continues as:]

specific maintenance script (compressOld.php) prior to page deletion.
Then you would have to switch to PHP5 and upgrade the installation to the master version of MediaWiki.

What the maintenance script will have to do is already stated in comment 0 ("The basic plan of action is to compile a list of content hashes in affected CGZ
blobs, and to match them up with broken pointers by comparing those content
hashes.")

I might as well assign bug 34925 to myself, as I have already spent the several hours necessary to understand how MediaWiki does text storage, and the code I have looks more complete than what has already been posted on Gerrit.

The only reason I didn't already do so was the possibility that Tim Starling might already have this mostly done ("This is a bug I'm tracking down and fixing"). However, if you manage to get the script done before Tim Starling or I do, I would love to take a look at it again.

Anomie subscribed.

After some discussion in code review on https://gerrit.wikimedia.org/r/#/c/393928/, it doesn't seem like anyone is going to actually do the work to clean up any archived revisions that are inaccessible due to that corruption.

The maintenance script in that patch is going to throw away whatever broken data is in inaccessible rows. Accessible rows will be rewritten into the text table and ExternalStore so they're no longer at risk of being corrupted.