Page MenuHomePhabricator

Add checksum field to database table; expose it in API
Closed, ResolvedPublic

Description

I propose that a checksum field be added to the text table that reflects the contents of the "old_text" column.

Reasoning:

  • This column could then be checked against at each insertion into the text table in order to avoid saving the same text multiple times. This could reduce the necessary disc footprint required to store an amount of revisions.
  • Exposing this checksum field via the API would allow developers to know when it is necessary to request the text of a revision. They may already have it a copy of it that was previously requested and can simply use that.
  • The checksum field would provide a quick mechanism for detecting identity reverts. By identity revert, I mean a reversion of a page to an *exact* previous state such that both revisions would contain text of the same checksum.

Disclaimer:
I develop user scripts for the English Wikipedia and would find a wide range of uses for this feature. In Wikipedia, identity reverts are very common. They are one of the basic operations that Wikipedians become familiar with. I am not sure if this applies to MediaWiki installations in general, but my intuition says that it its likely.


Version: unspecified
Severity: enhancement

Details

Reference
bz21860

Event Timeline

bzimport raised the priority of this task from to High.Nov 21 2014, 10:53 PM
bzimport set Reference to bz21860.

matthew.britton wrote:

Such a thing would definitely be useful for recent changes monitoring (can see quickly if the current revision of a page is identical to a previous revision, without retrieving the whole text of the previous revision). However since the field wouldn't be populated for older revisions, that would make it actually rather useless, at least to begin with.

(In reply to comment #1)

Such a thing would definitely be useful for recent changes monitoring (can see
quickly if the current revision of a page is identical to a previous revision,
without retrieving the whole text of the previous revision). However since the
field wouldn't be populated for older revisions, that would make it actually
rather useless, at least to begin with.

I don't think it'd be that useless in the beginning. Most reverts revert to a version that's at most 24-48h old (experience, no actual data to back this up), so "to begin with" would be a very small timeframe.

Interesting idea. Remember that the text table can also be an external store, so any implementation needs to keep that in mind.

(In reply to comment #3)

Interesting idea. Remember that the text table can also be an external store,
so any implementation needs to keep that in mind.

The checksum field (MD5 hash of the content?) would probably be in the revision table. It would also make the optimization of reusing text rows more feasible (this probably wouldn't have too much of an impact on Wikipedia due to our usage of gzip in external storage, but it could save a lot of space on third-party wikis not using ES).

Shouldn't we maybe really split this into 2 bugs?

(In reply to comment #5)

Shouldn't we maybe really split this into 2 bugs?

Not really needed, I think. The bulk of the work to be done is in adding the field, exposing it through the API is trivial.

Why 'revision' table, if it is just for recent change monitoring? If we add it to 'revision', people will want it on 'recentchanges' too, once we add it to 'recentchanges', there's not much win in having it on 'revision'... :)

(In reply to comment #6)

(In reply to comment #5)

Shouldn't we maybe really split this into 2 bugs?

Not really needed, I think. The bulk of the work to be done is in adding the
field, exposing it through the API is trivial.

True. At least it being assigned to the right components helps.. And with the API peoples being CC'd, it can be dealt with when necessary!

otoh, I personally think that is really useless change, as for a low chance of collisions, that can be calculated in other ways, we end up storing lots of data all around.

After more thinking, I'm marking this feature request as 'INVALID'.

Nobody has brought in any discussion here yet how useful that would be, except unaccounted assumptions.

If anyone wants to reopen this bug, feel free to provide collision percentages on the projects data, and how it deviates from 'length' collision data.

matthew.britton wrote:

(In reply to comment #10)

If anyone wants to reopen this bug, feel free to provide collision percentages
on the projects data, and how it deviates from 'length' collision data.

I think it's more that the collision odds for a checksum are low enough (on the order of, say, 1 in 2^32, even if not actually that low) that one can confidently assume they represent different data, which you can't do for length.

For heuristic "these might be the same revision" purposes, length is enough. For "this is definitely the same revision as revision X, therefore since we've already checked that revision, we don't have to check this one" purposes, it is not sufficient.

It would be a useful optimization to recent changes monitoring, but not that important from my perspective (at least, there are much more important issues to resolve).

I'm not sure how the conversation got steered to recent changes monitoring since I don't understand the usefulness of knowing when no-op edits are saved to articles, but I'd like to bring back the discussion of how a checksum could be used to quickly determine which revisions are reverts computationally. Reverts and reverting are functions of MediaWiki and the way it manages history.

Research suggests that most reverts in the English Wikipedia are identity reverts[1] (reverts that can be detected by comparing checksums between revisions). Detecting identity reverts through the current API requires the retrieval of the complete text of the revisions which consumes appreciable disk access and network resources. With the addition of checksums, this expensive retrieval process could be forgone and writing tools that interact with MediaWiki's API would be easier and more straightforward.

As for the possibility of collision, an MD5 checksum could contain 16^32 possible values which means that, so long as the realm of possible outputs is uniform, the probability of a collision is about 1 in sqrt(16^32)[2] or 1:18,446,744,073,709,552,000.

  1. http://doi.acm.org/10.1145/1240624.1240698
  2. http://en.wikipedia.org/wiki/Birthday_paradox

matthew.britton wrote:

(In reply to comment #12)

I'm not sure how the conversation got steered to recent changes monitoring
since I don't understand the usefulness of knowing when no-op edits are saved
to articles

No-op edits aren't saved at all, they have no entry in the page history.

I'd like to bring back the discussion of how a checksum could

be used to quickly determine which revisions are reverts computationally

Which is the reason it would be useful for recent changes monitoring.

vacuum wrote:

An additional field for one or more digests would be very useful for analysis purposes. Examples are WikiDasboard-like gadgets that shows who wrote the bulk of the content [1] (screendump from Norwegian (bokmål) Wikipedia [2]), history trees showing revert activity [3], identification of type of action, etc. This will have uses at least on the history page, the user contributions page and the recent changes page.

For some types of analysis the size of the content is sufficient, and if better identification is necessary the content for the revisions that seems similar because of the size can be downloaded, yet it would be better to have a digest instead. By using digests it won't be necessary to branch out to some special code to download additional information, the information would be readily available.

Calculating the value in the server each time they are used does not seems to be viable, yet it could be calculated if it is missing and then stored in the database. Different use may need different digests, but a MD5-like digest of the complete article and a Nilsimsa-like digest should be sufficient for most purposes. Not sure how Nilsimsa would turn out for large articles...

[1] http://blog.wikimedia.org/2010/wikidashboard-revisited/
[2] http://no.wikipedia.org/wiki/Fil:Skjermdump-history-graphic.png
[3] http://www.grouplens.org/node/427

vacuum wrote:

A solution to avoid changing the database schema is to add a digest function to the API, possibly allowing several digest functions. If this is used for identifying distinct versions within a returned set of revisions with similar sizes the digest can be very simple. We don't have to create 2¹²⁸ -ish possible values, 2⁸ -ish values is more than enough if combined with the size.

The API function must get the content for the revisions from the database, but only the digest is transfered. The database request will be heavy but serving the request to the client will not. For most uses there will never be a need to calculate the more heavy hash functions.

Something like "rvdigest=pearson", perhaps also using a variant of FNV-1 or even SHA1 or AES. It could also be interesting to use sone kind of locality sensitive hashing [1] to make similar systems as described in[2][3].

The computational heavier methods could be given a maximum number of revisions as a hard limit for each request, forcing the tool developer to choose the computational easy methods if they suffice for the purpose.

[1] http://en.wikipedia.org/wiki/Locality_sensitive_hashing
[2] http://www.grouplens.org/node/427
[3] http://doi.acm.org/10.1145/985692.985765

The API function must get the content for the revisions from the database, but
only the digest is transfered. The database request will be heavy but serving
the request to the client will not. For most uses there will never be a need to
calculate the more heavy hash functions.

This bug was closed exactly because it will put an extra load on database, not due to bandwidth issues.

There's a definite use-case for adding a column to the recentchanges table. It's not that expensive to store an extra column.

it may not be as expensive storing the column, but eventually people will start querying by it :-)

This bug was closed exactly because it will put an extra load on database, not
due to bandwidth issues.

John Erling Blad's proposal to generate the checksums only when requested should not significantly increase database load. In general, hashing algorithms are O(n) (linear) in complexity (very fast, even for large text chunks). The requested checksums could be cached to improve the performance, but given the speed at which checksums (even the "heavy" ones) can be generated, this benefit is likely to be small at best.

I was under the impression that this bug was closed because the developers either deemed that it was in unimportant new feature to support and/or not worth developer time. I cannot argue with that. Database load, on the other hand, is not a good reason to close this bug.

(In reply to comment #19)

John Erling Blad's proposal to generate the checksums only when requested
should not significantly increase database load. In general, hashing
algorithms are O(n) (linear) in complexity (very fast, even for large text
chunks). The requested checksums could be cached to improve the performance,
but given the speed at which checksums (even the "heavy" ones) can be
generated, this benefit is likely to be small at best.

I think you're overlooking the fact that Wikimedia wikis (like the English Wikipedia) are exceptional. For a normal MediaWiki installation, the database load would likely be negligible. I doubt the same is true for English Wikipedia database. It seems like a waste of resources to constantly re-compute hashes when it's trivial to store them.

vacuum wrote:

Seems like the arguments starts to change and it is more a discussion about wetter such functionality should be allowed than if there are real reasons for not supplying them. To somehow stop or hinder the additional services seems rather counterproductive so I don't want to argue about why and how that shall be done.

If the API supply some functionality it must be assumed that someone will use the functionality to make available some additional services. In this case the base functionality is to supply the text for the revisions. This will take a long time to transfer and make some additional services difficult to implement. It will not make them impossible to implement. At the same time the servers will get additional load to transfer the content.

The services _are_ possible to implement and someone will build them. I have made some such services and surely someone else surely will do too. In this situation it is wise to carefully consider which measure can be taken to decrease the load on the system. Right now we are stuck with a solution that creates a very large load on the servers.

By calculating a digest in the servers instead of in the clients the total load from transferring the information will be lower. If the digests are calculated for each call it would probably be wise to use digests which are easy to calculate. This still puts some strain on the database servers as they has to serve the text for each revision. By storing the digests in the database more heavy digests can be used, and even some heavy to compute locality hashing functions can be used. All such digests will be fast to transfer, decreasing the load on the servers.

I propose that we use store at least one digest in the database. This should be stored in such a way that it is possible to use it for revert detection on the recent changes page. Probably this should use something like MD5 on a uncompacted version of the content, check out string grammar if its unknown to you. It should also be possible to use it for revisions in the API, this to support services like WikiDashboard. There are probably several lists that should be adjusted accordingly. To support services like history flow we need something like a nilsimsa digest on identifiable blocks in the content. I would propose that each paragraph has a nilsimsa digest. I'm not sure if this needs to be stored in the database, as this will be requested rather seldom. Probably they can be calculated on the fly.

I guess at least one digest should be stored in the recent changes table for page revisions. If this lacks digests for older entries that shouldn't do much harm, the interesting thing here is to detect wetter a new version is unique within a short timeframe. I would keep digests for comparison in memcached for lets say the 4-5 last revisions of an article. Note that this digest must be reasonable collision free for the complete set of articles and revisions, that is something like MD5 or other very long digests.

In the revisions table it will be stored at least one digest to identify revisions. Such digests can be more lightweight and use compacted versions of the content. The digests should be rather short integers. Remember that those shall be used to detect similar versions, not the larger sets from the recent changes table.

For history flow I would compute the digests on the fly in the web server. Such computations would be rather heavy, but they will be done seldom.

There's been a few offline discussions about how MD5 sums might relate to performance, so I figured I better document my take here. At our current rate of editing (if I'm reading the numbers correctly), we get somewhere on the order of 10-12 million edits per month, which translates into a rate of something like 5 edits per second. On my laptop, the built-in PHP function (which is presumably
what we'd use) takes somewhere on the order of 2 milliseconds to compute the MD5 sum of the built in American English dictionary file on my computer (which is approximately 931k). This is all back-of-the-envelope stuff, so I could be wildly off somewhere, but it seems like pretty negligible load given the benefit of not forcing everyone else to compute these values downstream.

are you going to include it into any 'revision' index?

Domas, I don't have an opinion on whether it belongs in the index. The most important use case (including in the XML dumps) doesn't seem to require it. There are compelling use cases discussed above that would make adding it to the index interesting, but I know that's a much more expensive request. If the primary objection to adding these checksums revolves around adding the field to the index, we should split that conversation off into a separate request.

Aaron added a rev_sha1 column in to tables in r94289.

(In reply to comment #25)

Aaron added a rev_sha1 column in to tables in r94289.

r94289 and subsequent revisions reverted by Brion in r94541.

Bumping this from "low" to "high" priority. This is the first that I've noticed it stuck in low. I get asked very frequently about this at WMF, and have long wanted to provide this. Aaron is going to be tied up with a few other things so it may not happen right away, but I'd like to make we get to this before we deploy 1.19.

I would very, very much like to have this field.

(In reply to comment #28)

I would very, very much like to have this field.

Agreed.

Now does that mean it's worth the cost? I think so. I know that Domas disagrees and I respect that. I think that it would be very helpful to get a better picture of what the real cost here is, particularly if the field is indexed, which is probably inevitable.

A SHA1 digest and a local sensitive hash function would do the trick, but a small vector (that is the form before turning it into a hash) is even better. A Javascript version of something that works is available as a gadget at Norwegian (bokmål) Wikipedia.
http://no.wikipedia.org/wiki/MediaWiki:Gadget-page-authors-simple.js

As I recall the FNV-algoritm is used in MySQL (also PHP?) and should be pretty effective in native implementations. The rest of the code has some problems in a few very special usecases and should be fixed, but works fairly well even as it is.

Basically you need a digest and a local sensitive hash (or fingerprint vector) to speed up analysis of who is the actual author and whats going on in the article history.

Any action to add the new field to api output and to the xml export?

The api modules prop=revisions and list=deletedrevs need at least a way to display the new value.

Searching for a hash can also interessting for api clients, but that need a index for miser mode.

So did anybody actually check to make sure we know how to deploy this change live for 1.19?

(In reply to comment #32)

Any action to add the new field to api output and to the xml export?

The api modules prop=revisions and list=deletedrevs need at least a way to
display the new value.

Searching for a hash can also interessting for api clients, but that need a
index for miser mode.

Done in r106514.