Page MenuHomePhabricator

MD5 or SHA1 checksum in stub dumps
Closed, ResolvedPublic

Description

Author: erikzachte

Description:
There is growing audience for revert stats. Nimisz Gautam and Erik Zachte both made scripts to generate revert stats based on comparing revisions in the dumps via MD5 sums. Rob Lanphier expects MD5 can be used for even fancier processing.

Right now the only way to harvest MD5's is by parsing the full archive dumps which takes forever.

Proposal is to store MD5's in stub dumps for every revision. This would allow monthly refresh of revert stats (see URL above) and regular publication of revert data files for researchers.

e.g.

<page>
  <title>United States Declaration of Independence</title>
  <id>19</id>
  <revision>
    <id>1926607</id>
    <timestamp>2010-06-15T22:06:14Z</timestamp>
    <contributor>
      <username>Innotata</username>
      <id>172490</id>
    </contributor>
    <text id="1894246" />
    <md5>eff7d5dba32b4da32d9a67a519434d3f</md5>
  </revision>
</page>

Version: unspecified
Severity: enhancement
URL: http://stats.wikimedia.org/EN/EditsRevertsEN.htm

Details

Reference
bz25312

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:17 PM
bzimport set Reference to bz25312.

I would like this for another reason; I would like to see it used to compare the stub text revision in the file with the one we get from the db. This could make sure that the stuff on disk doesn't wind up silently corrupted and then carried forward in its corrupted state forever (or until someone stumbles across the problem).

However... These sums have to be computed at some point, and that's going to slow the dumps down considerably. For my usage case we would want the md5sum in the revision metadata somewhere (maybe too slow to do that at each page revision save?)

We would not be able to generate the MD5 sums for the stub dumps in any case; we wouldn't know them in advance of reading the text, unless they were added to the revision table someplace, in which case see the above.

I'm probably missing something, so feel free to get into technical details of what folks want and how it could work in practice.

The fancier processing that Erik is referring to is this:
http://www.mediawiki.org/wiki/Pending_Changes_enwiki_trial/Reversion_collapsing

...which really isn't all that specific to Pending Changes.

I'm listing bug 21860 ("Add checksum field to text table; expose it in API") as a blocker for this one, even though it may not necessarily be one. Adding the MD5 to the stub dumps would be much simpler if it were in the db.

I don't see why #21860 is a blocker - if text is being read, calculating checksums is cheap enough.
Storing all that in the database isn't free.

We do not need a cryptographic hash (like md5) but we can use a hash such as murmurhash (http://sites.google.com/site/murmurhash/ and http://en.wikipedia.org/wiki/MurmurHash) which seems to be one of the fastest around. There is also a PHP implementation at http://sites.google.com/site/nonunnet/php/php_murmurhash

(In reply to comment #4)

We do not need a cryptographic hash (like md5) but we can use a hash such as
murmurhash

I did some quick benchmarks, and both the md5() and sha1() PHP functions are very fast, even for multi-megabyte inputs, so speed is not an issue.

SHA1 *might* make more sense than MD5, if only because it may help us in a crazy future where we leverage tools associated with Git or other version control systems (for example, Mercurial uses SHA1 as well). Not that there's anything planned, but since the choice of hash is somewhat arbitrary otherwise, SHA1 might be slightly preferable.

Fields added to tables in r94289.

Thanks Aaron! This is a very welcome feature.

r94289 and subsequent revisions reverted by Brion in r94541.

(In reply to comment #3)

I don't see why #21860 is a blocker - if text is being read, calculating
checksums is cheap enough.
Storing all that in the database isn't free.

When creating a stub dump, we haven't read the text yet -- the job of fetching and inserting the text is being deferred to a later process (textDumpPass) which pulls the text either from a previous dump or from the text table / external storage etc.

So at that point, only data within the 'page' and 'revision' tables, and anything else that can be very cheaply fetched, is available.

A rev_sha1 field that's already been pre-filled out would be usable for creating stub dumps; calculating from text after it's been read would only be usable on the final dumps (or else a second equivalent pass).

Using a separate field for this also gives greater confidence that there was not internal data corruption; if the sha1 is generated from the text that's right next to it in the same file, there's no point -- the client could calculate it as easily and reliably as the server could have, and in neither case will it indicate if the data has been corrupted on the backend.

(In reply to comment #7)

SHA1 *might* make more sense than MD5, if only because it may help us in a
crazy future where we leverage tools associated with Git or other version
control systems (for example, Mercurial uses SHA1 as well). Not that there's
anything planned, but since the choice of hash is somewhat arbitrary otherwise,
SHA1 might be slightly preferable.

I don't think there'd be much chance at integration here really; git's object references are based on SHA-1 checksums, but of the entire object including a header indicating type ('blob' for files) and size prepended.

Very correct about the data integrity piece, as I mentioned in comment 1. I use rev_len for now but that is not foolproof. I've seen a number of revisions on other projects that have identical revision lengths (and they are not redirects either but actual content). We've had serious data corruption in the past, and odds are we'll run into it again for one reason or another.

Bug 2939 did look like something that this blocked. Wouldn't checksum revert detection be the way to fix that bug?

(In reply to comment #13)

Bug 2939 did look like something that this blocked. Wouldn't checksum revert
detection be the way to fix that bug?

Bug 2939 is about the ability to detect reverts for the purpose of displaying the new messages notification bar. That would rely on the ability to uniquely identify revisions by putting unique identifiers in the database (bug 21860). Putting unique identifiers in the stub dumps (this bug, bug 25312) wouldn't really have anything to do with that.

Created attachment 9461
Patch adds a new sha1 tag to each revision in XML dump.

It will write the sha1 hash if the revision row contains this field, else it will write an empty tag. Not sure if that is the best way to do it and if there are any other edge case that I didn't think of then please let me know. Patch also updates export-0.6.xsd.

Attached:

I guess that the revision row would always contain the field, whether or not it is populated, since the patch to Export.php should go in at the same time as the schema change.

I would suggest though that we don't provide the hash when the revision has been deleted; in that case we would want to write an empty tag.

Hi Ariel, good point! I'll fix it for deleted revisions.

Yes, I think so. I updated export.php so that it will be exported to the xml files once 1.19 is deployed.

  • Bug 33221 has been marked as a duplicate of this bug. ***