Page MenuHomePhabricator

Run maintenance/populateRevisionLength.php on all WMF wikis
Closed, ResolvedPublic

Description

Author: snunes

Description:
The "size" (bytes) of each revision is not available in revisions prior to the introduction of this feature.
Please recompute all the missing "size" values on the entire Wiki. I notice this while using the API to extract the size of old revisions from Wikipedia.


Version: unspecified
Severity: normal

Details

Reference
bz12188

Event Timeline

bzimport raised the priority of this task from to High.Nov 21 2014, 10:00 PM
bzimport set Reference to bz12188.

Bryan.TongMinh wrote:

Probably we should compute rev_size on demand if it isn't available and update the database?

snunes wrote:

@Bryan Tong Minh:

I think it is a good option. Eventually all articles would be updated and it would work in all MediaWiki installations.
However, it might be easier to run an update script to solve the issue.
In the end, its up to the sysadmins to decide :)

Please vote the bug up so that it gets fixed.

mike.lifeguard+bugs wrote:

Removed shell keyword, as there's nothing to do on shell yet - a maintenance script would have to be written I guess?

Bryan.TongMinh wrote:

Proposed patch

Proposed patch. Have no opportunity to test it atm.

attachment patch.txt ignored as obsolete

  • Bug 17421 has been marked as a duplicate of this bug. ***

On-demand size expansion could potentially be expensive for interactive use (eg pulling up a history view with a few hundred revisions); it also wouldn't handle cases like Special:Longpages / Special:Shortpges generation.

Bug 18881 covers a request for a maintenance script to update them in bulk.

A maintenance script to populate rev_len for old revisions was added in r63221. Now someone just needs to run it. Adding back "shell" keyword, removing "patch" and "need-review".

jeluf wrote:

Running right now...

jeluf wrote:

Completed for all wikis but enwiki.

650 Million records reviewed so far.

enwiki has some 350 Million more that needs to be reviewed and the process runs much much slower on enwiki than on any other wiki.

jeluf wrote:

About 600-700 hours to go...

happy.melon.wiki wrote:

(In reply to comment #11)

About 600-700 hours to go...

It takes longer than the dump script?!? That's slightly bizarre. And an average of less than 2 revs/sec? Is it spending most of its time waiting for slaves or something?

jeluf wrote:

enwiki ... doing rev_id from 43'814'001 to 43'816'000 (total: 363'892'360)

in 14.42 seconds, estimated completion in 641.3 hours

2000 revisions in 14 seconds is the current speed. It will become faster at the end when the revisions already have rev_len set.

(In reply to comment #12)
I haven't checked, but my guess would be that the dump scripts know how to deal efficiently with Wikimedia's peculiar ways of storing article text. populateRevisionLength.php doesn't, it just fetches the text of one revision at a time, which is probably not optimal when multiple revisions are stored in a single compressed blob. I'll let others who know more about the storage backend stuff decide whether optimizing it somehow would be worth the effort, given that it's never going to be run on Wikimedia wikis again.

jeluf wrote:

The ticket was open for about 30 months. I think it can wait for another month for completion...

The dump scripts read old revisions from previous dump files; that's why they are faster. When we run them without the so-called prefetch option, they are very slow, much slower than the populateRevisionLength script. We are seeing about 38 revs/second.

happy.melon.wiki wrote:

(In reply to comment #16)

The dump scripts read old revisions from previous dump files; that's why they
are faster. When we run them without the so-called prefetch option, they are
very slow, much slower than the populateRevisionLength script.

Slightly OT, but is that why getting the dump process going again was so painful? Because the longer it went without a complete dump, the more work was needed to generate the next one?

Well it's a bit worse than that; the revision text content of the previous dumps is suspect, so we are running te dumps without prefetch. This is extremely painful. Once the revision length is populated in the db we can compare that against the length of the revision in the previous xml file and if they don't match we can refresh from the db. This will be a big improvement over the current approach.

My rough back-of-the-napkin calculations indicate that since rev_len started to be populated around rev_id 124 million, and the script has processed up to about 51 million revs so far in ascending order, at 2000 revs per 14 secs it should catch up in about 6 days. Do bear in mind though that I can't add :-P

So we're at around revision number 115 600 000 now, and it's taking about 26 seconds for 2000 revs. This extends my estimate. If we see no further slowdowns we will catch up in two days (i.e. by Thursday June 3 at this time).

Columns in archive tables are not fully populated currently.

(In reply to comment #22)

Columns in archive tables are not fully populated currently.

I'm closing this again as "fixed." This bug was about easily retrieving the size of current revisions (from the revision table). It seems reasonable to have a separate script (or an option in populateRevisionLength.php) to calculate the lengths of deleted revisions (in the archive table), but that's a distinct issue and should be filed (if it isn't already) separately.

Per a request on #wikimedia-tech

mysql:wikiadmin@db1019 [lbwiki]> select rev_id, rev_user, rev_page, rev_deleted, rev_len, rev_timestamp from revision where rev_id = 185751
+--------+----------+----------+-------------+---------+----------------+

rev_idrev_userrev_pagerev_deletedrev_lenrev_timestamp

+--------+----------+----------+-------------+---------+----------------+

185751580834460NULL20061203231418

+--------+----------+----------+-------------+---------+----------------+
1 row in set (0.03 sec)

I'm just re-running the whole script on lbwiki, won't take long. Let's see the number of rows it reckons it has set/fixed

(In reply to comment #24)

I'm just re-running the whole script on lbwiki, won't take long. Let's see
the
number of rows it reckons it has set/fixed

...doing rev_id from 1520601 to 1520800
rev_len population complete ... 89 rows changed (0 missing)

89/1520800 = 0.00585% unpopulated.

For sanity, I'm going to run it over all wikis with --force to clean up any stragglers that may be laying around

(In reply to comment #25)

For sanity, I'm going to run it over all wikis with --force to clean up any
stragglers that may be laying around

Does this cover the archive table?

My suspicion was that these were archive.ar_len rows that were never populated and the revisions got re-inserted into the revision table at some point (probably through page undeletion). Maybe.

mzmcbride@willow:~$ sql lbwiki_p;
mysql> select * from logging where log_page = 83446\G

  • 1. row ******* log_id: 57108 log_type: delete log_action: restore

log_timestamp: 20110331152146

log_user: 120

log_namespace: 0

log_deleted: 0

log_user_text: Otets

  log_title: Stéier_(Astrologie)
log_comment: 5 5 Versioune goufe restauréiert: Läschgrond gëtt eliminéiert wou existent
 log_params: 
   log_page: 83446

1 row in set (0.00 sec)

Beep boop.

(In reply to comment #23)

(In reply to comment #22)

Columns in archive tables are not fully populated currently.

I'm closing this again as "fixed." This bug was about easily retrieving the
size of current revisions (from the revision table). It seems reasonable to
have a separate script (or an option in populateRevisionLength.php) to
calculate the lengths of deleted revisions (in the archive table), but
that's a distinct issue and should be filed (if it isn't already) separately.

Okay, this is being covered by bug 24538 and bug 46183.