Page MenuHomePhabricator

populateRevisionSha1.php misses archive rows, whose text is stored directly in the archive table (not text table)
Closed, ResolvedPublic

Description

populateRevisionSha1.php omits archive rows, whose ar_rev_id is NULL [1].
Nevertherless, those omitted archive rows may need a SHA1 hash, as they may stem from MediaWiki 1.4. See tables.sql:396–398 :

  • Old entries from 1.4 will be NULL here, and a new rev_id will
  • be created on undeletion for those revisions. ar_rev_id int unsigned,

If ar_rev_id is NULL, the archived revision text is stored in column ar_text.

[1] Selection by "$idCol IS NOT NULL" in populateRevisionSha1.php:75


Version: 1.20.x
Severity: normal

Details

Reference
bz34373

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 12:12 AM
bzimport set Reference to bz34373.
bzimport added a subscriber: Unknown Object (MLST).

Tagging for 1.19wmf fix even though I think we're only doing a partial rollout for sha1.

Aaron, if you don't think this is necessary, please change.

While the proposed fix does indeed add a SHA1 sum to archive rows whose text is stored directly within the archive table, the proposed fix introduces two problems:

  • It breaks populating SHA1 for the rows from the revision table and from the archive table (only for rows which have their text in the text table).

Changing the if condition from

$this->upgradeRow( $row, $db, $table, $idCol, $prefix )

in populateRevisionSha1.php:86 to

$this->upgradeRow( $row, $table, $idCol, $prefix )

fixes this issue.
But lacking commit access, I wonder whether I should file a separate bug for this regression.

  • The SHA1 sum set with previous versions of the script for archive rows that store their text in the text table does not match the SHA1 sum that they would receive now.

By comparing the sums and experimenting with the text, it seems, the text used for SHA1 computation now gets rtrimmed.
Having some SHA1 sums for the old approach and some for the new approach in the table is not quite optimal for automatic verification ;)
Has this change been introduced on purpose?

(In reply to comment #4)

  • It breaks populating SHA1 for the rows from the revision table and from the

archive table (only for rows which have their text in the text table).
Changing the if condition from

See r111881.

  • The SHA1 sum set with previous versions of the script for archive rows that

store their text in the text table does not match the SHA1 sum that they would
receive now.

I don't know what you mean. getRawText() returns the text as it is used by all of MediaWiki. The SHA-1 of that is the SHA-1 of the text.

Created attachment 10047
SQL snippet showing an archive along with its text that procudes different SHA1 sums depending on the used MediaWiki version

attachment withoutSHA1.sql ignored as obsolete

Created attachment 10049
SQL snippet showing an archive along with its text that procudes different SHA1 sums depending on the used MediaWiki version (2nd try)

Attached:

(In reply to comment #5)

(In reply to comment #4)

  • It breaks populating SHA1 for the rows from the revision table and from the

archive table (only for rows which have their text in the text table).
Changing the if condition from

See r111881.

Great!

  • The SHA1 sum set with previous versions of the script for archive rows that

store their text in the text table does not match the SHA1 sum that they would
receive now.

I don't know what you mean.

Upon taking a closer look at the problem, I realized the problem is more twisted than I indicated. I attached a SQL file (attachment 10049) to show the problem.

Running populateRevisionSha1.php for <r111795, the ArchiveUncompressedPlainUtf8 gets a SHA1 of t1vsj53xq35125uu7m495n0v15l0tu3.

Running populateRevisionSha1.php for >=r111795 [1], the very same row gets a SHA1 of gyh34cr8h1pice2eo0ff2krn7zsnl72.

The problem seems to be that the two archive rows of the SQL file share the same timestamp and updateRevisionSha1.php relies on the timestamps being unique.
However, I cannot see a reason for the timestamps being unique. Is there one?
If it's just an educated guess, I guess it would be good to also add the ar_title/ar_len column to reduce the risk of false SHA1 sums.

[1] Of course, one has to additionally apply the above patch removing $db from the upgradeRow call, to get the script running.

Fixed in r111920. This was a straightforward problem with trying to handle too much irregularity in one function and having a misleadingly named $idCol var.

Looks good for me. -> Marking resolved/fixed

needs to be backported to 1.19, I think