In maintenance/tables.sql it says
CREATE TABLE /*_*/revision (... -- Key to text.old_id, where the actual bulk text is stored. -- It's possible for multiple revisions to use the same text, -- for instance revisions where only metadata is altered -- or a rollback to a previous version. rev_text_id int unsigned NOT NULL,
OK, it seems currently only the for instance revisions where only metadata is altered is taken advantage of. However, regrettably, or a rollback to a previous version. was forgotten about.
I.e., currently when an undo on an article is done, a new row is created
in the page table in the database.
However it is very likely that this new row will be identical to a
previous row there for the same article.
All we need to do is double check if this is true, and then the id of
the previous row could be placed in the revision table, instead of the
id for the new row, and we needn't create the new row.
This would eliminate the principal source of duplicated rows in the text
table! I'm not saying that you won't still need to fund raising for more
disks anymore, but it is still a shame to see all that text duplicated
when an identical row sitting right under our nose could be reused.
All we need to do is check if this row = previous row.
Actually when you think about sysop rollbacks and even any edits at all
to an article, there is a significant chance that the text (that would
end up in a row) of this edit is an exact match for that of say that of
one of the last 10 edits.
That way the internals that are manipulating the tables needn't be
especially be told "this is an undo, double check if there is an
identical text row we could reuse".
Instead, when any edit is done, just check back for oh, up to 10 of the
text rows of its old revisions for identical text, and point to that row
if found, instead of making a new row.
We needn't search all over the text table for identical rows, we merely
need to look down our short list of very likely suspects.
Yes, we'll never catch the savings of detecting that blanked page A =
blanked page B, as they are not revisions of the same page, but on the
other hand we needn't search further than out short list of likely
suspects.
I tried to make a patch to implement all this, but I only got as far as
Article.php and then got lost. If you want you can assign this bug to me
and after a few weeks I will figure it out (if I don't end up breaking
something else.)
By the way, here are some views of the duplicated text one can try
$ echo "SELECT old_text FROM text;"| mysql --default-character-set=binary MyDB -N| perl -nwle '$h{$_}++;END{for(keys %h){print $h{$_}}}'| sort -nr|uniq -c $ echo "SELECT old_text FROM text;"| mysql --default-character-set=binary MyDB -N| sort|uniq -c|sort -nr| perl -C -anwle 'exit if $F[0]==1;/.{0,77}/;print $&;'
Version: 1.15.x
Severity: enhancement