Page MenuHomePhabricator

Several history entries for the same content and watchlists showing wrong article size changes
Closed, ResolvedPublic

Description

http://fr.wikipedia.org/w/index.php?title=Neil_Armstrong&action=history shows two manual (script-assisted) reverts to the same edit, in revisions 79252172 and 79252174.

This shouldn't be possible. Either Le pro du 94 :) should have got an edit conflict, or it should have been silently skipped without getting a history entry.

This could be something serious (such as CronologyProtector not working, or the code no longer being in a transaction) or trivial (such as the js script breaking the edit atomicity).

rev_id 79252174 by Le pro du 94 :) has a rev_parent_id of 79252172 (Superjuju10 edit) which suggests a bug in our code.

The text of 79252172 and 79252174 are equal. After-the-fact editing of one doesn't override the other (as expected). Both show a rev_sha1 of o2dxb378cwej8zv7lx5y1p0wk87oduo.


Version: 1.20.x
Severity: major

Details

Reference
bz37225

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 12:29 AM
bzimport set Reference to bz37225.

I don't think ChronologyProtector is involved here or that the script[1] could cause it by itself. We should dig into EditPage/Article for this race condition.

1- http://fr.wikipedia.org/wiki/MediaWiki:Gadget-LiveRC.js

We seem to have a good number of these double-entries, usually by the same user (even by bots!).

Find a sample below:
https://en.wikipedia.org/w/index.php?diff=495195923&oldid=495195868
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495187022
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495187432
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495188594
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495188774
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495188789
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495188800
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495190963
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495186485
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495185912
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495183490
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495182487
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495182476
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495182292
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495182160
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495182128
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495181345
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495181176
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495181078
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495181009
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495180628
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495179602
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495176201
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495176177
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495173896
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495172714
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495171495
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495171337
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495171041
https://en.wikipedia.org/w/index.php?diff=prev&oldid=495170956

You can't get an edit conflict with yourself, the subsequent edit just overwrites the previous edit. This allows the user to click "back" after a save to return to the edit form, in order to amend an edit.

Fair, but you can't create two edits for the same content either (ignoring log actions as moves, protect...).

I tried to reproduce the race with maintenance/edit.php but had no luck so far. A lagged slave might be needed.

This appears to be due to r79518, when essential pre-save consistency checks were modified to use the slave DB instead of the master.

I see the line:

$this->loadPageData( 'fromdbmaster' );

...before the text is loaded in WikiPage::doEdit(). It will the page_latest field from the master to get the revision/text. That text is compared to the incoming text to see if anything changed. In fact, wrapped outside of this, EditPage has a loadPageData( 'forupdate' ) call in internalAttemptSave(). It doesn't seem like it should be easy for this to happen.

There is a "$dbw->begin( METHOD );" right after $change is checked in doEdit(). This will release the FOR UPDATE lock. In theory, another process can save an edit after that and cause a race.

Those trasnsaction-breaking begins...

But I don't think it matters. You enter into updateRevisionOn(), which is passed $oldid and used as WHERE page_latest=$oldid, so if another process saved an edit in the middle, updateRevisionOn() returns false and it'd rollback.

richardg_uk wrote:

Ongoing issue since 2012-05-30 at http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Diff.2Fwatchlist_difficulties (and other sections on the same page).

Related to bug 36566 ?

I am also seeing this at Wikiquote. The spurious edits are reported on Recent Changes with a size of +<article size>.
I vote to prioritize fixing this because I am a regular Recent Changes watcher and the error gives the strong but misleading appearance of pattern vandalism.

richardg_uk wrote:

This still seems to be a frequent occurrence, given the continuing reports at enwiki:VPT.

Would a database query help by identifying the precise time on 2012-05-30 that these duplicate entries (appearing like saved null edits) began appearing?

The time could then be compared with the recent server admin activity at http://wikitech.wikimedia.org/view/Server_admin_log#May_30

It looks like WikiPage::loadPageData(), which calls loadFromRow(), does not clear mLastRevision, which includes a process cache of the revision text. So, AFAIK, if I do something like:

$wp = WikiPage::factory( $title );
$text1 = $wp->getRawText(); loads text from slave
...wait...
$wp->loadPageData( 'fromdbmaster' );
loads page row from master

$text2 = $wp->getRawText(); // returns same value cached in mLastRevision

$text2 will equal $text1 even if an edit was done in the meantime.

WikiPage::doEdit() relies on this not being the case. Otherwise, "$oldid = $this->getLatest();" will get the true latest rev ID but "$oldtext = $this->getRawText();" will be stale, so $changed could be erroneously true, causing a new revision insertion. The updateRevisionOn() race check will not detect any races because $oldid *is* the latest and not some stale value.

*** Bug 36566 has been marked as a duplicate of this bug. ***

(In reply to comment #5)

This appears to be due to r79518, when essential pre-save consistency checks
were modified to use the slave DB instead of the master.

r79518 was present in 1.18, and the reports started being reported at 30th of May, which point to 1.20wmf4 deployment on 28th of May.

From the common merge base of 1.20wmf3 and 1.20wmf4 (50927) to 1.20wmf4 (5e5773), git log 50927..origin/wmf/1.20wmf4 lists 248 commits to review.

Filtering to changes to Article.php / WikiPage.php, only shows the "Generalizing LinksUpdate" changes (a12ce17, fa132be) as a candidate. Still, I don't see anything that stands as wrong there. They are all changes to link updates done *after* an edit.

As a side note: I think it would be helpful to put all database actions needed when saving a page into SqlUpdate objects, and then perform all updates using a single call to DataUpdate::runUpdates(). This way, all updates are performed in a single transaction, and it would be much easier to understand which updates are needed in which order.

It's worth noting that one of the symptoms of the problem is that rc_old_len is incorrectly filled in as 0 for these edits.

If there's not going to be a full fix for preventing the edits, at least this could resolved; rc_old_len is used for the diff sizes on watchlists and recentchanges.

(In reply to comment #14)

As a side note: I think it would be helpful to put all database actions needed
when saving a page into SqlUpdate objects, and then perform all updates using a
single call to DataUpdate::runUpdates(). This way, all updates are performed in
a single transaction, and it would be much easier to understand which updates
are needed in which order.

The links updates need to be done in a separate transaction from other updates.

Priority set to highest (This bugs "affects a number of users and impedes their use of the site" on "Wikimedia's servers in production".)

(In reply to comment #11)

It looks like WikiPage::loadPageData(), which calls loadFromRow(), does not
clear mLastRevision, which includes a process cache of the revision text. So,

AFAIK, if I do something like:

https://gerrit.wikimedia.org/r/#/c/10722/1 should fix this.

I did some UDP logging when that staleness occurs and each of the few log entries in fact coincide with duplicate edits so far:

Some spaces where the issue is discussed:

Two consecutives days on the French village pump:

Already noted on the English Wikipedia village pump:
http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Diff.2Fwatchlist_difficulties

Discussions also occurred a little bit on #wikipedia-fr channel, a strong bit on #wikipedia-fr-liverc (a patrolling channel). They indicated in a nutshell it's slower to patrol, as they've to check plenty of void/duplicated edits.

Aaron's fix *should* fix this no later than the 1.20wmf5 deployment (Monday for en.wikipedia.org, and Wednesday for *.wikipedia.org). It would probably be a good idea to backport this to 1.20wmf4.

Aaron is not 100% confident that the fix he made will fix entirely fix this, which is the point of the logging. We'll keep an eye on this.

richardg_uk wrote:

Example null-change edit to [[Belfast]]:

http://en.wikipedia.org/w/index.php?title=Belfast&diff=498122258&oldid=498122204

The above edit was made 2012-06-18T04:03:02‎ (UTC), duplicating the 2012-06-18T04:02:27‎ (+880bytes) of a vandalism edit made 2012-06-18T04:01:58 (-880bytes).

The 04:03:02 edit is currently shown on a watchlist as incorrectly having +880 bytes compared with the 04:02:27 edit (even though the page text is identical).

The zero byte difference is similarly shown incorrectly as +880bytes on [[Special:RecentChangesLinked]]: http://en.wikipedia.org/w/index.php?title=Special:RecentChangesLinked&from=20120618040157&target=Talk%3ABelfast (and presumably on [[Special:RecentChanges]]

The edit is correctly shown on the revision history as having zero byte difference: http://en.wikipedia.org/w/index.php?title=Belfast&action=history

My guess is that the 1.20wmf4 rollout in the past few hours may have corrected the byte-difference in the history page but has failed to correct the differences on watchlists and recentchange lists.

Of course, the above example (being made prior to this evening's rollout) gives no indication of whether wmf4 is now successfully preventing further zero-edits!

(In reply to comment #21)

My guess is that the 1.20wmf4 rollout in the past few hours may have corrected
the byte-difference in the history page but has failed to correct the
differences on watchlists and recentchange lists.

Of course, the above example (being made prior to this evening's rollout) gives
no indication of whether wmf4 is now successfully preventing further
zero-edits!

1.20wmf4 was rolled out to the English Wikipedia on the 4th. Today 1.20wmf5 was rolled out to that wiki. See https://www.mediawiki.org/wiki/MediaWiki_1.20/Roadmap#Schedule_for_the_deployments for more information.

richardg_uk wrote:

(In reply to comment #22)

1.20wmf4 was rolled out to the English Wikipedia on the 4th. Today 1.20wmf5 was
rolled out to that wiki. ...

Sorry, "wmf4" was my typo (in both places), should read "wmf5".

To clarify: the Belfast null-edit was made prior to today's
wmf5 rollout, but the changelist byte discrepancies are as shown now (after the rollout).

So, whether or not wmf5 has prevented future null-edit saves, it is not listing the byte differences correctly on change lists (other than on the history page).

The watchlist gets the length diff from the recentchanges table, so it will show the same values (correct or invalid).
Then, the values stored at recentchanges were copied on edit. Release of 1.20wmf5 is expected to fix the bug, but won't fix the old wrong data. So it isn't an evidence of 1.20wmf5 wrongdoing.

richardg_uk wrote:

I can now confirm that the wmf5 rollout on enwiki has NOT prevented new duplicate entries (saved null edits).

There are 4 consecutive identical edits to [[Croatia national football team]] by [[User:86.182.161.144]], at 2012-06-18T20:45:02‎, 20:45:48‎, 20:46:38 and 20:47:14‎ (UTC), e.g.:
http://en.wikipedia.org/w/index.php?title=Croatia_national_football_team&diff=next&oldid=498236646

The first edit is correctly listed as +11 bytes. The subsequent 3 edits (identical in content to the first) are incorrectly listed at [[Special:RecentChanges]], [[Special:RecentChangesLinked]] and [[Special:Watchlist]] as each adding the entire length of the article (+112,385 bytes, listed 3 times).

The byte changes are correctly listed on the history tab (as +11, 0, 0, 0, respectively).

But the underlying problem is that null changes should not appear at all, so the fix did not work.

From the logging, I suspect that there is a parallel bug somewhere that also causes this problem. Only one cause has been fixed.

richardg_uk wrote:

Even freakier, enwiki:VPT allowed a NON-null edit conflict 6 hours ago.

The history tab gives the following details:

First, an archive bot correctly removed several sections:

  • 2012-06-19T06:40:10‎ (UTC) MiszaBot_II (198,571 bytes) (-10,318):

http://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&diff=498298525&oldid=498279415

Then, an editor's new comment was saved, but with the archived sections reinserted:

  • 2012-06-19T06:40:16‎ Hoo_man (209,033 bytes) (+10,462):

http://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&diff=498298534&oldid=498298525

Proof that this is a technical inconsistency, not just a manually-overridden edit-conflict:

  • In Special:Watchlist and Special:RecentChangesLinked (and presumably also Special:RecentChanges), the second edit is recorded as having a delta of +209,033bytes - in other words, compares the new edit with zero instead of with the number of bytes in the previous edit.

Fortunately, the RecentChanges byte-delta bug proves that this was caused by MediaWiki and not by the editor! So the minor bug is at least evidencing the major one.

What are some examples of duplicate edits on 6/19/12 or later?

supertoff4516 wrote:

(In reply to comment #28)

What are some examples of duplicate edits on 6/19/12 or later?

That ?

http://fr.wikipedia.org/w/index.php?title=New_Edmonton_Arena&action=history

(In reply to comment #29)

(In reply to comment #28)

What are some examples of duplicate edits on 6/19/12 or later?

That ?

http://fr.wikipedia.org/w/index.php?title=New_Edmonton_Arena&action=history

fr wasn't upgraded until today, en was upgraded Monday.

supertoff4516 wrote:

(In reply to comment #30)

(In reply to comment #29)

(In reply to comment #28)

What are some examples of duplicate edits on 6/19/12 or later?

That ?

http://fr.wikipedia.org/w/index.php?title=New_Edmonton_Arena&action=history

fr wasn't upgraded until today, en was upgraded Monday.

ok

richardg_uk wrote:

A current example from enwiki:

Duplicate edits of [[Wilhering Abbey]] by [[User:193.171.131.230]]:

  • 2012-06-21T12:51:53 UTC (+9 bytes)
  • 2012-06-21T12:51:55 UTC (+19,493 bytes) = a saved null edit

http://en.wikipedia.org/w/index.php?title=Wilhering_Abbey&curid=4739545&diff=498658206&oldid=498658203

Using ddsh, I can see that all servers are using the same version of WikiPage.php, so it can't be servers with stale code.

(In reply to comment #33)

Using ddsh, I can see that all servers are using the same version of
WikiPage.php, so it can't be servers with stale code.

Gah, spoke too soon. Servers 190, 206, 266, 281 are not reachable. They are all still in the apache pybal group, and thus are receiving requests. It's possible they are running stale code. Srv281 may also be causing trouble with a swift config change for the same reason.

(In reply to comment #34)

Gah, spoke too soon. Servers 190, 206, 266, 281 are not reachable. They are all
still in the apache pybal group, and thus are receiving requests. It's possible
they are running stale code. Srv281 may also be causing trouble with a swift
config change for the same reason.

These have actually already been disabled and don't seem to be taking traffic from LVS according to Tim. So that looks ruled out as a cause after all.

Logging confirms a second bug that can cause this. In doEdit(), "$changed = ( strcmp( $text, $oldtext ) != 0 );" doesn't account for the rtrim() call on the text in the Revision constructor. If the pre-trimmed text is used for comparison but the trimmed text is inserted, you could have duplicate edits. However, none of the logged instances resulted in duplicated edits so far (since the new text had changes other than whitespace in the edits):
2012-06-22 12:39:18 mw19 trwiki: Rev text trimmed for page 1469915 2012-06-22 13:06:08 mw35 trwiki: Rev text trimmed for page 970567 2012-06-22 13:45:02 srv288 enwiki: Rev text trimmed for page 26204397 2012-06-22 15:11:55 mw24 enwiki: Rev text trimmed for page 26204397 2012-06-22 15:40:52 mw24 enwiki: Rev text trimmed for page 26204397

  • Bug 37829 has been marked as a duplicate of this bug. ***

(In reply to comment #32)

A current example from enwiki:

Duplicate edits of [[Wilhering Abbey]] by [[User:193.171.131.230]]:

  • 2012-06-21T12:51:53 UTC (+9 bytes)
  • 2012-06-21T12:51:55 UTC (+19,493 bytes) = a saved null edit

http://en.wikipedia.org/w/index.php?title=Wilhering_Abbey&curid=4739545&diff=498658206&oldid=498658203

Thanks. Do we have a list of instances that came up (like a user subpage or something)? I added some more logging and would like to see if it corresponds to new problems.

richardg_uk wrote:

(In reply to comment #38)

Thanks. Do we have a list of instances that came up (like a user subpage or
something)? I added some more logging and would like to see if it corresponds
to new problems.

Intermittent reports continue at http://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Diff.2Fwatchlist_difficulties but I think most editors assume that you are already well aware of the problem so aren't bothering to itemise the many occurrences!

More current examples from [[Special:RecentChanges]]:

  • [[2012 Arab Nations Cup]] 2012-06-22T18:40:58 and 18:41:06 UTC (null edit listed as +14,713bytes):

http://en.wikipedia.org/w/index.php?title=2012_Arab_Nations_Cup&curid=27527671&diff=498870742&oldid=498870712

  • [[Manchester United F.C.]] 2012-06-22T18:44:22 and 18:44:52 UTC (non-null edit listed as +91,463bytes):

http://en.wikipedia.org/w/index.php?title=Manchester_United_F.C.&diff=498871316&oldid=498871227

The second example is unusual because the editor revised his/her own previous edit with a NON-null edit, so it does not superficially appear problematic. But the recentchanges log incorrectly shows a delta equal to the article size, even though the article size was unchanged, so this is presumably the same underlying bug.

Have you identified any software changes that coincided with the initial flurry of reports on 2012-05-30?

None of those examples match the logging, so there is clearly a third independent cause of this bug. So the first was fixed already, the second is awaiting reviewing in gerrit (https://gerrit.wikimedia.org/r/#/c/12613/), and the third (and hopefully last) cause is still elusive.

This is starting to feel like turning over a rock...

  • Bug 37734 has been marked as a duplicate of this bug. ***
  • Bug 37443 has been marked as a duplicate of this bug. ***

(In reply to comment #39)

Have you identified any software changes that coincided with the initial flurry
of reports on 2012-05-30?

You can check http://wikitech.wikimedia.org/view/Server_admin_log, I didn't see anything too interesting there.

richardg_uk wrote:

I've no idea which server log items might be relevant, but we could narrow down the timeframe.

If you have not already done so, could you extract the relevant recentchanges entries (or a large sample) spanning the days around 2012-05-30?

The table is purged after 30 days (at least on enwiki), so a database query now would
be useful to preserve a comprehensive record for future analysis, even if your current focus is on logging within the current code.

For example, the rough SQL below should identify most duplicate or edit-conflicting rows in recentchanges:

  • including edit/create entries which have a prior byte length of zero or null
  • but excluding possible page creates (where there is no recent prior edit/create)
  • and excluding possible unblanked pages (where there exists any recent edit with a new byte length of zero, or there is no recent prior edit).

SELECT trc.* from recentchanges trc, ( SELECT rc_cur_id, MIN(rc_id) AS rc_id_min, MIN(COALESCE(rc_new_len,0)) from recentchanges WHERE rc_new_len>0 AND rc_type<=1 GROUP BY 1 HAVING MIN(COALESCE(rc_new_len,0))>0 ) trcmin WHERE (trc.rc_old_len=0 OR trc.rc_old_len IS NULL) AND trc.rc_type<=1 AND trc.rc_cur_id=trcmin_rc_cur_id AND trc.rc_id>trcmin.rc_id_min

For simplicity, the above query ignores edits which do not have a valid older entry on recentchanges; and ignores all pages which have recently been blanked.

Hope that's helpful; apologies for being presumptious if not! With the 30 day purging time imminent, I thought it best to make a specific suggestion.

  • Bug 37919 has been marked as a duplicate of this bug. ***

Results of

http://toolserver.org/~saper/bug37225/query.sh

from plwiki_p are here:

http://toolserver.org/~saper/bug37225/20120624.txt

I suspect bug 37919 describes one specific case of this bug (maybe we need a new title here?)

(In reply to comment #44)

SELECT trc.* from recentchanges trc, ( SELECT rc_cur_id, MIN(rc_id) AS
rc_id_min, MIN(COALESCE(rc_new_len,0)) from recentchanges WHERE rc_new_len>0
AND rc_type<=1 GROUP BY 1 HAVING MIN(COALESCE(rc_new_len,0))>0 ) trcmin WHERE
(trc.rc_old_len=0 OR trc.rc_old_len IS NULL) AND trc.rc_type<=1 AND
trc.rc_cur_id=trcmin_rc_cur_id AND trc.rc_id>trcmin.rc_id_min

This query had a minor typo in it ("trcmin_rc_cur_id" --> "trcmin.rc_cur_id"). Here are the results from enwiki_p (warning: 8004 results): http://p.defau.lt/?Z52AerNoGQ3hKM77IYNa5g.

richardg_uk wrote:

Thanks for running those queries and making the data available.

Apologies for the typo. Also, I shouldn't have included "rc_new_len>0" within the WHERE condition of trcmin (it defeats the HAVING condition which should have excluded previously-blanked pages), so the above results contain "background noise" of a few spurious unblanking edits.

For reference, the following SQL should exclude these cases, and is reformatted to exclude some irrelevant columns and revdeleted rows (simplifying where null handling is not needed):

SELECT trc.rc_timestamp, trc.rc_id, trc.rc_user, trc.rc_user_text, trc.rc_namespace, trc.rc_title, trc.rc_minor, trc.rc_bot, trc.rc_cur_id, trc.rc_this_oldid, trc.rc_last_oldid, trc.rc_new_len, trc.rc_deleted, trc.rc_logid, trc.rc_log_type, trc.rc_comment FROM recentchanges trc, ( SELECT rc_cur_id, MIN(rc_id) AS rc_id_min, MIN(rc_new_len) from recentchanges WHERE rc_type<=1 GROUP BY 1 HAVING MIN(rc_new_len)>0 ) trcmin WHERE trc.rc_old_len=0 AND trc.rc_type<=1 AND trc.rc_cur_id=trcmin.rc_cur_id AND trc.rc_id>trcmin.rc_id_min AND trc.rc_deleted=0 ORDER BY 1

But it's probably not necessary to re-run a query, because the onset timing is clear if we filter out all non-mainspace edits and obvious mainspace blanking reverts.

Oddly, the rc_last_oldid values seem to be correct, even though rc_old_len is always incorrect.

The earliest enwiki mainspace delta discrepancy is at 2012-05-29T08:50:30 UTC:
http://en.wikipedia.org/w/index.php?title=Impeachment_of_Renato_Corona&diff=494919562&oldid=494919551

The first 15 enwiki mainspace discrepancies are:

  • 2012-05-29T08:50:30 Impeachment_of_Renato_Corona [non-null delta-discrepancy edit]
  • 2012-05-29T08:50:39 Johannes_Brahms
  • 2012-05-29T08:50:41 Impeachment_of_Renato_Corona
  • 2012-05-29T08:50:43 Johannes_Brahms
  • 2012-05-29T08:50:54 Impeachment_of_Renato_Corona
  • 2012-05-29T08:51:05 Impeachment_of_Renato_Corona
  • 2012-05-29T09:41:51 Uzi
  • 2012-05-29T09:41:53 Tuba
  • 2012-05-29T09:41:57 Tuba
  • 2012-05-29T09:53:09 UEFA_Euro_2012_squads
  • 2012-05-29T10:08:57 Sydney_Opera_House
  • 2012-05-29T10:17:34 Nicole_Kidman
  • 2012-05-29T10:20:04 Confederation_of_African_Football
  • 2012-05-29T10:26:51 Diamond_Jubilee_of_Elizabeth_II
  • 2012-05-29T10:27:19 Diamond_Jubilee_of_Elizabeth_II

The error rate seems to have stabilised at around 15 an hour, other than a significantly higher rate on 2012-06-20 between approximately 21:00 and 22:00 UTC.

The earliest plwiki mainspace discrepancy is at 2012-05-30T06:35:26 UTC:
http://pl.wikipedia.org/w/index.php?title=Ben_10:_Tajemnica_Omnitrixa&diff=31467920&oldid=31467918

The plwiki discrepancies recur at approximately 3-hour intervals during waking hours so there may be some significance to the 22 hour difference between enwiki onset and plwiki onset.

Cross-checking against the server admin log, could the 2012-05-29 disk space error be relevant? (Perhaps it was not logged until a few hours after data corruption: "16:10: hashar: srv187 and srv188 are out of disk space".)

Hope that's helpful - I don't know much about the innards of MediaWiki so I'm guessing what might be relevant!

1.20wmf4 was deployed on 2012-05-28, but for testwiki and mediawikiwiki only. enwiki waited until June 4.

srv187 and srv188 were apaches, so they wouldn't affect page revisions.

Perhaps the sync of php-1.20wmf4 the day before broke/uncached something? :/

Bumping down priority to "high". This problem is annoying, but it's not so important as to justify stopping all other dev work to fix this. Aaron continues to investigate this among his other tasks.

  • Bug 37965 has been marked as a duplicate of this bug. ***
  • Bug 38086 has been marked as a duplicate of this bug. ***

Getting more and more dupes for the same problem is very annoying. Please fix this as soon as possible. Thanks.

lexein-w wrote:

(In reply to comment #50)

Bumping down priority to "high". This problem is annoying,

This bug is more than merely annoying, it's needlessly alarming to editors. Bug 38086 - a reported size change of "(+110,766)", when it should be "(0)" seemingly indicates a massive article change (blanking/vandalism/experiments). These usually require immediate action such as protection or blocking.

If this bug persists, editors will begin to see all high deltas as false alarms, and begin ignoring them. This will allow real vandalism to persist longer.

(In reply to comment #53)

Getting more and more dupes for the same problem is very annoying.

More "dupes" are appearing because the title of this bug is misleading and does not appear when searching for "watchlist wrong size change" or "wrong delta".

lexein-w wrote:

(In reply to comment #54)

... the title of this bug is misleading ...

I see the title was nicely expanded. Thx.

I did some more live debug logging.

I see things like:
2012-07-06 22:05:50 mw69 frwiki: Could not load rev 80557024 for Bargen (Berne) [208044].
2012-07-06 22:02:57 mw68 itwiki: Could not load rev 50844198 for Grandecourt [195339].
2012-07-06 21:28:18 mw39 enwiki: Could not load last rev for Dallas Mavericks [72880].
...which usually refer to duplicate edits.

So the "$revision = Revision::newFromPageId( $this->getId(), $latest );" call in loadLastEdit() seems to return false sometimes, causing getRawText() to return false, causing the strcmp( $text, $oldtext ) to detect a change in text. This can lead to duplicate revision, and an RC row with a change size of the whole text (since $oldsize is 0).

I don't know why that happens, or why newFromPageId() would return false there. It falls back the master if the revision is not found, so in theory it shouldn't be slave lag. But looking at the above logs (which give rev ID, page title, and page ID) and inspecting the pages, those revs clearly exist with those parameters, so there is no reason they shouldn't have been found. Normally, something like this points to a bug around slave lag.

In any case, I may be able to stop this problem by just throwing exceptions when getRawText() returns false but getLatest() is not zero.

I also see strange things like:
2012-07-06 21:49:15 srv216 enwiki: Could not find text with ID 0.
...that sounds like its own bug to look at. Sounds like something maybe using Revision::newFromRow() without rev_text_id in the field list and then calling getText().

(In reply to comment #56)

I did some more live debug logging.

But why is doEdit() called more than once? per previous comments, this also happens to bot edits.

What happens when the master can't be reached (in Revision::newFromConds())? Any chance we could get a null back instead of an exception?

(In reply to comment #58)

What happens when the master can't be reached (in Revision::newFromConds())?
Any chance we could get a null back instead of an exception?

Normally, no unless the ignoreErrors() function was called on the DB. Running the sha1 script though, I noticed that sometimes the DB class forgets that the connection was dropped, and I get raw mysql_query warnings. I think this happens when DB errors are caught and then the DB is used further down the line.

After some playing around with xmldumps-test, I am able to reliably
reproduce the problem automatically: It's a race condition, that can
be triggered in the following way:

(1) Start with a page, whose last revision is from PersonA.
(2) PersonB starts uploading a new revision of the page.
(3) PersonC starts uploading a new revision of the page.
(4) Request from PersonC finishes
(5) Request from PersonB finishes

Between (2) and (5), PersonB's connection to the master database executes the SQL sequence

[...]
SELECT /* User::loadGroups */ ug_group FROM user_groups WHERE ug_user = '1'
SELECT /* Block::newLoad Root */ ipb_id,ipb_address,ipb_by,ipb_by_text,ipb_reason,ipb_timestamp,ipb_auto,ipb_anon_only,ipb_create_account,ipb_enable_autoblock,ipb_expiry,ipb_deleted,ipb_block_email,ipb_allow_usertalk,ipb_parent_block_id FROM ipblocks WHERE ipb_address = 'Root'
SELECT /* WikiPage::pageData Root */ page_id,page_namespace,page_title,page_restrictions,page_counter,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len FROM page WHERE page_namespace = '0' AND page_title = 'PageZ' LIMIT 1 FOR UPDATE
[...]

If PersonC's request is dealt /between/ this given first and third
SELECT, both revisions (PersonB's and PersonC's) end up in the
database: The oldest version is PersonA's, followed by PersonC's, and
finally PersonB's. In the "Recent Changes", the PersonB's revision
lists its full size as +value.

No database error, connection loss, network problem, etc need to
occur.

Yes, it's only a rough description. But as I will not find time to
investigate the issue further in the next few days, I hope this
description and SQL sequence nevertheless helps others to quickly
reproduce the race condition as well and hunt it down. As the
setup is tedious, I attached some log files. ;-)

Created attachment 10836
SQL logs taken, while the bug occurs

Attached:

Note that an exception is now thrown per comment 56. This should not be occurring anymore. A more ideal solution is still in order.

Seems Aaron provided a hotfix while I prepared the logs for uploading
in the previous comments :D

But towards his order of a "more ideal solution" ...
it seems the culprit is the usage of "FOR UPDATE" in (only) some of
the relevant queries. So some queries see values updated by other
transactions, while other queries do not.

Assume we want to edit a page PageZ that is at revision 5. Both B and
C are each about to submit a new revision.

B submits first, and opens a master (say BM) and a slave (say BS)
connection to the database. Both connections are on a transaction.
For both BM and BS, the page's page_latest is obviously 5, and the
highest rev_id in the revision table for the page is obviously 5 as
well.

/Before/ B's loadPageData executes the sql of EditPage.php:1206

$this->mArticle->loadPageData( 'forupdate' );

C starts submitting her revision. So C also opens a master (say CM)
and a slave (say CS) connection to the database. Both connections are
on a transaction as well.
For both CM and CS, the page's page_latest is obviously 5, and the
highest rev_id in the revision table for the page is obviously 5 as
well.

Now while B's request is handled slowly, C's request zooms right
through. C's revision gets added (using rev_id 6) on CM for the page.

For BM, BS, and CS, the pages page_latest is still 5, while for CM it
is 6. Accordingly for the page's highest rev_id.

C's request is done. The transactions CM and CS are committed.

Now B's request reaches the SQL for EditPage.php:1206

$this->mArticle->loadPageData( 'forupdate' );

. This SQL statement (for connection BM) is

SELECT /* WikiPage::pageData Root */
page_id,page_namespace,page_title,page_restrictions,page_counter,
page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len
FROM `page` WHERE page_namespace = '0' AND page_title = 'PageZ'
LIMIT 1 FOR UPDATE

Although the page's page_latest has been 5 before for BM, this query
now yields a page_latest of 6, due to the "FOR UPDATE" suffix.

But recall that for both BM and BS the page's page_latest is still 5
without "FOR UPDATE", just as their highest rev_id in the revision
table for the page is also still at 5.

So the subsequent Revision::fetchFromConds on BS (with "rev_id = '6'") fails.
Also the subsequent Revision::fetchFromConds on BM (with "rev_id ='6'") fails.

Should we:

  • just stick with the hotfix of exception throwing,
  • restart our transaction to be able to fetch revs that have been committed after our transaction started,
  • add more "FOR UPDATE"s to be able to fetch revs that have been committed after our transaction started,
  • get rid of the "FOR UPDATE"s and detect an edit conflict, or
  • something completely different?

So first my apologies to have caused this bug with Gerrit changeset #6079. I did already mentioned MySQL's "consistent read" feature there, but I did not see that problem. I did choose to use "FOR UPDATE" in EditPage.php to catch edit conflicts before executing WikiPage::doEdit(), but it seems that this is causing more harm than good.

An easy solution is to remove the "FOR UPDATE" and use the already-present conflict detection code in WikiPage::updateRevisionOn(), since it put a condition on the page_latest field with fetched value when updating the database row. So a value mismatch will trigger a transaction rollback and an edit-conflict form.

It seems the isolation between different edits or actions which creates revision has been removed, and this introduces another (maybe more severe) bug: if two such requests are sent simultaneously, request B may fail with "<?xml version="1.0"?><api servedby="mw71"><error code="internal_api_error_MWException" info="Exception Caught: Could not find text for current revision 21906884." xml:space="preserve" /></api>" and 21906884 is the revid of another edit by request A, which is incomplete when request B is processed.

That's the exception added by Aaron. It's better to throw such exception than producing this bug.

I've fixed DATA_FOR_UPDATE loading in https://gerrit.wikimedia.org/r/#/c/16228/.

I've replaced the 'forupdate' loading with 'fromdbmaster' in EditPage in https://gerrit.wikimedia.org/r/#/c/16229/.

Did we apply a fix retrospectively in the end?

Possibly related (may cause premature commits, though I don't see how it could trigger this exact problem): I80faf2ed

To avoid the issue of premature commits and general transaction confusion, I suggest to merge I8c0426e1 soon (along with the fixes I mentioned in the comments there).