Page MenuHomePhabricator

Eliminate duplicate "text" table blobs caused by rollbacks
Closed, ResolvedPublic

Description

In maintenance/tables.sql it says

CREATE TABLE /*_*/revision (...
  -- Key to text.old_id, where the actual bulk text is stored.
  -- It's possible for multiple revisions to use the same text,
  -- for instance revisions where only metadata is altered
  -- or a rollback to a previous version.
  rev_text_id int unsigned NOT NULL,

OK, it seems currently only the for instance revisions where only metadata is altered is taken advantage of. However, regrettably, or a rollback to a previous version. was forgotten about.

I.e., currently when an undo on an article is done, a new row is created
in the page table in the database.

However it is very likely that this new row will be identical to a
previous row there for the same article.

All we need to do is double check if this is true, and then the id of
the previous row could be placed in the revision table, instead of the
id for the new row, and we needn't create the new row.

This would eliminate the principal source of duplicated rows in the text
table! I'm not saying that you won't still need to fund raising for more
disks anymore, but it is still a shame to see all that text duplicated
when an identical row sitting right under our nose could be reused.

All we need to do is check if this row = previous row.

Actually when you think about sysop rollbacks and even any edits at all
to an article, there is a significant chance that the text (that would
end up in a row) of this edit is an exact match for that of say that of
one of the last 10 edits.

That way the internals that are manipulating the tables needn't be
especially be told "this is an undo, double check if there is an
identical text row we could reuse".

Instead, when any edit is done, just check back for oh, up to 10 of the
text rows of its old revisions for identical text, and point to that row
if found, instead of making a new row.

We needn't search all over the text table for identical rows, we merely
need to look down our short list of very likely suspects.

Yes, we'll never catch the savings of detecting that blanked page A =
blanked page B, as they are not revisions of the same page, but on the
other hand we needn't search further than out short list of likely
suspects.

I tried to make a patch to implement all this, but I only got as far as
Article.php and then got lost. If you want you can assign this bug to me
and after a few weeks I will figure it out (if I don't end up breaking
something else.)

By the way, here are some views of the duplicated text one can try

$ echo "SELECT old_text FROM text;"|
mysql --default-character-set=binary MyDB -N|
perl -nwle '$h{$_}++;END{for(keys %h){print $h{$_}}}'|
sort -nr|uniq -c
$ echo "SELECT old_text FROM text;"|
mysql --default-character-set=binary MyDB -N|
sort|uniq -c|sort -nr|
perl -C -anwle 'exit if $F[0]==1;/.{0,77}/;print $&;'

Version: 1.15.x
Severity: enhancement

Details

Reference
bz18333

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 10:35 PM
bzimport set Reference to bz18333.
bzimport added a subscriber: Unknown Object (MLST).

The above will stop new duplication, and would be a shame not to implement.

But what about all the years and years of current duplication already
existing in one's text table?

Should there be a program in maintenance/ available to squeeze it out?

Should it also be run by update.php? Or just once in a wiki's
lifetime? Or just by interested parties who feel the need?

That program would squeeze out duplicates by:
{for each page {go down its list of revisions making duplicate
pointers point to their first}}, the run purgeOldText.php.

One needn't go to "SHA1 mapping to unbloat the text table" (
http://lists.wikimedia.org/pipermail/wikitech-l/2009-March/042373.html
) extremes.

However, perhaps we needn't restrict our thinking to a per article
paradigm, but instead just consider the whole revisions->text table
mapping. Maybe that would be a simpler and smarter way to do this.
We would thus only involved two tables... (wait, we must consider all
tables that have any mapping to the text table! Also all this must be
done with the wiki locked probably, though it would probably only take
few seconds for a small wiki.)

E.g., running our shell/perl scripts above we find 279 separate
pointers to blank (0 byte, vandalism) article revisions. These could
all be made to point to a single text row, even though they are not of
the same article.

Created attachment 5997
script to squeeze text table and reset pointers to it

The neat thing is for many of our checks, we needn't go beyond the
revisions table, as the size of the text is stored there. So if the
size does not match, there is no need to haul it out of the text table
for a string compare.

OK, I have achieved squeezing _current_ duplicates out of the text
table and resetting their pointers! See attached squeeze_example.txt.

One might squeeze out current duplicates, then sit back and see where
any new ones are arriving from, and then address each case...

Attached:

(In reply to Dan Jacobson from comment #2)

Created attachment 5997 [details]
script to squeeze text table and reset pointers to it

Dan: If you're still interested in getting this into MediaWiki, would you be interested in turning this into a patch in Gerrit? See https://www.mediawiki.org/wiki/Gerrit/Tutorial and https://www.mediawiki.org/wiki/Developer_access

Attached:

Gee that was five years ago.
Looking back boy I must have been a wiz kid.
But now I am not that smart.
So I'll just leave whatever it is up to you guys.

Krinkle renamed this task from eliminate principal source of duplicated text table rows: unchecked undos to Eliminate duplicate "text" table blobs caused by rollbacks.Jul 28 2018, 10:22 PM
Krinkle claimed this task.
Krinkle moved this task from Untriaged to Usage problem on the MediaWiki-libs-Rdbms board.
Krinkle subscribed.

I won't speak for edits that happen to by chance match the text of a previous revision, but at least the "rollback" feature does re-use previous text blobs nowadays. I don't think other cases are worth optimising at this point.

  • 22:21, 28 July 2018‎ . . (326 bytes) (+48)‎ . . (Tag: Rollback)
  • 22:21, 28 July 2018‎ . . (278 bytes) (-48)‎ . .
  • 21:29, 19 July 2018‎‎ . . (326 bytes) (+13)‎ . .
wikiadmin@10.68.18.35[enwiki]> SELECT * FROM revision WHERE rev_page=191808 ORDER BY rev_timestamp DESC LIMIT 3;
+--------+-------------+---------+---------------+
| rev_id | rev_text_id | rev_len | rev_parent_id |
+--------+-------------+---------+---------------+
| 382686 |      430067 |     326 |        382685 |
| 382685 |      430694 |     278 |        382498 |
| 382498 |      430067 |     326 |        381803 |
+--------+-------------+---------+---------------+
3 rows in set (0.01 sec)