Page MenuHomePhabricator

nlwiki logged in users shown wrong version of page, possible db corruption
Closed, ResolvedPublic

Description

Author: romaine.wiki

Description:
In the past days several users, including myself, have experienced that visiting a voting page (where users can request becoming admin) they do not see the most recent version of that page. The problems seem to occur only with this page (so far we know): https://nl.wikipedia.org/wiki/Wikipedia:Aanmelding_moderatoren

  • there are moments users get the most recent page, it works fine
  • there are moments users get an older version, but with clicking edit they edit the current page
  • there are moments users get an older version of the page, and click edit and also edit in an older version (with loss of votes)

This also happens when a user has visited the most recent version of the page and later goes back, and then still sometimes get an older version of the page.

Purging often helps for the moment, but not always. But the problem keeps occurring also after having purged and later returning to that page.


Version: wmf-deployment
Severity: critical

Details

Reference
bz52853

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:45 AM
bzimport set Reference to bz52853.

romaine.wiki wrote:

At the bottom of the page it shows:
Deze pagina is het laatst bewerkt op 10 aug 2013 om 19:30.

(Translation: This page is last edited on 10 Aug 2013 at 19:30.)

This even while the most recent version is of 14 aug 2013 16:37‎.

Different servers seem to give different results. Refreshing a few times helps to see both versions.

10 aug 2013 at 19:30: mw1058, mw1166, mw1219
14 aug 2013 om 16:37: mw1088, mw1176, mw1020, mw1041

romaine.wiki wrote:

Also this bug has effect on the WhatLinksHere pages like:
https://nl.wikipedia.org/wiki/Speciaal:VerwijzingenNaarHier/Wikipedia:Mededelingen/Actueel

There the Wikipedia:Aanmelding_moderatoren sometimes is missing and sometimes is linked, while nobody changed that page in the mean while and that page is full time included in Wikipedia:Aanmelding_moderatoren.

When I refresh or access this page 20 times, I get randomly 8 times the version of 10 Aug 2013 19:30 and 12 times the most recent revision of 14 Aug 2013 16:37.

The versions that show today the revision of 10 Aug 2013 19:30 have various of:
<!-- Served by mw1058 in 1.962 secs. -->
<!-- Served by mw1090 in 2.596 secs. -->
<!-- Served by mw1102 in 1.918 secs. -->
<!-- Served by mw1219 in 1.505 secs. -->
<!-- Served by mw1176 in 1.492 secs. -->

Nothing suspicious listed on https://wikitech.wikimedia.org/wiki/Server_admin_log

I just tried 20 times with

wget -S https://nl.wikipedia.org/wiki/Wikipedia:Aanmelding_moderatoren

but always getting

Last-Modified: Wed, 14 Aug 2013 15:02:40 GMT

and I'm also based in Europe.

My 4th try used mw1219 that you listed above, but date is August 14th too:

$:andre\> grep "Served by" Wikipedia\:Aanmelding_moderatoren*
Wikipedia:Aanmelding_moderatoren:<!-- Served by mw1186 in 0.172 secs. -->
Wikipedia:Aanmelding_moderatoren.1:<!-- Served by mw1177 in 0.165 secs. -->
Wikipedia:Aanmelding_moderatoren.10:<!-- Served by mw1093 in 0.201 secs. -->
Wikipedia:Aanmelding_moderatoren.11:<!-- Served by mw1093 in 0.230 secs. -->
Wikipedia:Aanmelding_moderatoren.12:<!-- Served by mw1093 in 0.230 secs. -->
Wikipedia:Aanmelding_moderatoren.13:<!-- Served by mw1113 in 0.235 secs. -->
Wikipedia:Aanmelding_moderatoren.14:<!-- Served by mw1048 in 0.214 secs. -->
Wikipedia:Aanmelding_moderatoren.15:<!-- Served by mw1059 in 0.215 secs. -->
Wikipedia:Aanmelding_moderatoren.16:<!-- Served by mw1020 in 0.141 secs. -->
Wikipedia:Aanmelding_moderatoren.17:<!-- Served by mw1211 in 0.163 secs. -->
Wikipedia:Aanmelding_moderatoren.18:<!-- Served by mw1079 in 0.215 secs. -->
Wikipedia:Aanmelding_moderatoren.19:<!-- Served by mw1080 in 0.144 secs. -->
Wikipedia:Aanmelding_moderatoren.2:<!-- Served by mw1054 in 0.313 secs. -->
Wikipedia:Aanmelding_moderatoren.20:<!-- Served by mw1033 in 0.218 secs. -->
Wikipedia:Aanmelding_moderatoren.3:<!-- Served by mw1171 in 0.164 secs. -->
Wikipedia:Aanmelding_moderatoren.4:<!-- Served by mw1219 in 0.165 secs. -->
Wikipedia:Aanmelding_moderatoren.5:<!-- Served by mw1081 in 0.214 secs. -->
Wikipedia:Aanmelding_moderatoren.6:<!-- Served by mw1075 in 0.265 secs. -->
Wikipedia:Aanmelding_moderatoren.7:<!-- Served by mw1094 in 0.208 secs. -->
Wikipedia:Aanmelding_moderatoren.8:<!-- Served by mw1093 in 0.151 secs. -->
Wikipedia:Aanmelding_moderatoren.9:<!-- Served by mw1093 in 0.235 secs. -->

Strangely enough, in a request with these response headers:

Cache-Control:private, must-revalidate, max-age=0
Connection:keep-alive
Content-Encoding:gzip
Content-Language:nl
Content-Length:44598
Content-Type:text/html; charset=UTF-8

Date:Wed, 14 Aug 2013 15:58:41 GMT
Expires:Thu, 01 Jan 1970 00:00:00 GMT
Last-Modified:Wed, 14 Aug 2013 15:37:57 GMT

Server:nginx/1.1.19
Vary:Accept-Encoding,Cookie
Via:1.1 sq72.wikimedia.org:3128 (squid/2.7.STABLE9), 1.0 amssq42.esams.wikimedia.org:3128 (squid/2.7.STABLE9), 1.0 amssq37.esams.wikimedia.org:80 (squid/2.7.STABLE9)
X-Cache:MISS from amssq37.esams.wikimedia.org
X-Cache:MISS from amssq42.esams.wikimedia.org
X-Cache:MISS from sq72.wikimedia.org
X-Cache-Lookup:MISS from amssq37.esams.wikimedia.org:80
X-Cache-Lookup:MISS from amssq42.esams.wikimedia.org:3128
X-Cache-Lookup:MISS from sq72.wikimedia.org:3128

I actually get a page with, in the HTML, 'laatst bewerkt op 10 aug 2013 om 19:30' (mw1113 this time).

I cannot reproduce the issue with wget either - maybe there's a difference between being logged-in or not?

Note that this *also* happens on edit pages:
https://nl.wikipedia.org/w/index.php?title=Wikipedia:Aanmelding_moderatoren&action=edit

actually shows the *old* wikitext for some servers (mw1024).

The specific revision returned by the 'old servers' is this one:
https://nl.wikipedia.org/w/index.php?title=Wikipedia:Aanmelding_moderatoren&oldid=38674166

Based on the above and some other pastebins posted on IRC, these aren't frontend caching (Squid/Varnish) related at all (logged in users = no caching, MISS in X-Cache etc.). From the output that we see, Last-Modified is correct, but reportedly content body is not. Maybe a parser cache issue?

romaine.wiki wrote:

One users reports that in the history of this particular page he notes something strange. ( http://nl.wikipedia.org/w/index.php?title=Wikipedia:Aanmelding_moderatoren&offset=&limit=500&action=history )

He says he sees that one of rows has a strange high number of added bytes: +59.018. While in this particular edit not much has changed:

(huidig | vorige) 10 aug 2013 19:44‎ Glatisant (Overleg | bijdragen)‎ . . (59.018 bytes) (+59.018)‎ . . (→‎Voor moderatorschap Dqfn13) (ongedaan maken)

When I looked in the history, I saw this:
(huidig | vorige) 10 aug 2013 19:44‎ Glatisant (Overleg | bijdragen)‎ . . (59.018 bytes) (+124)‎ . . (→‎Voor moderatorschap Dqfn13) (ongedaan maken)

The exact same edit, with a different value of bytes added. I tried to reproduce the +59.018 by pressing F5 and after some version I got this same +59.018 showing up in some loads of the history.

the edit of 10 aug 2013 19:44 is the first edit after the version that keeps appearing: 10 aug 2013 19:30.

(In reply to comment #7)

One users reports that in the history of this particular page he notes
something strange. (
http://nl.wikipedia.org/w/index.php?title=Wikipedia:
Aanmelding_moderatoren&offset=&limit=500&action=history
)

He says he sees that one of rows has a strange high number of added bytes:
+59.018. While in this particular edit not much has changed:

(huidig | vorige) 10 aug 2013 19:44‎ Glatisant (Overleg | bijdragen)‎ . .
(59.018 bytes) (+59.018)‎ . . (→‎Voor moderatorschap Dqfn13) (ongedaan maken)

When I looked in the history, I saw this:
(huidig | vorige) 10 aug 2013 19:44‎ Glatisant (Overleg | bijdragen)‎ . .
(59.018 bytes) (+124)‎ . . (→‎Voor moderatorschap Dqfn13) (ongedaan maken)

The exact same edit, with a different value of bytes added. I tried to
reproduce the +59.018 by pressing F5 and after some version I got this same
+59.018 showing up in some loads of the history.

the edit of 10 aug 2013 19:44 is the first edit after the version that keeps
appearing: 10 aug 2013 19:30.

Confirmed (Timestamps are different by 2 hours for me (17:44). Presumably you have timezone set in your preference).

After refreshing multiple times, I was able to reproduce this history thing. (Where edit by Glatisant [revision 38674247] shows a diff size the same as its total size) The bad history page I got was served by mw1063 (Although seems more likely for something to be messed up on the db side then on the apache side).

Other weird things: https://nl.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Wikipedia:Aanmelding_moderatoren&rvlimit=5&rvprop=ids|flags|timestamp|user|userid|size|sha1|contentmodel|comment|parsedcomment|tags&rvstart=20130810174421&rvlimit=3&format=xml&rvdir=older Sometimes it shows 38674177 by Dqfn13, other times it skips that revision.

My theory would be that one of the database slaves has corruption related to that revision.

cc'ing Sean and Asher based on Brian's db slave corruption theory. This is a pretty unfun looking issue.

Also, I can confirm the flipping between showing Dqfn13 and not on the last url in comment 8.

select * from revision where rev_page = 8838 order by rev_id desc limit 140;

on db1002, the last five revision ids are, in order:
38674247 38674177 38674166 38674128 38674117

on db1009, the last five revision ids are:
38674247 38674166 38674128 38674117 38674097

and that is because:
mysql:wikiadmin@db1009 [nlwiki]> select * from revision where rev_id = 38674177;
Empty set (0.01 sec)

whereas on all other slaves (that show up in the noc dbtree listing) the record is present.

Not sure how to investigate the cause of the corruption though.

Note that according to https://wikitech.wikimedia.org/wiki/Server_Admin_Log we had

13:59 paravoid: powercycling db1009, down for 20h, kernel prints "BUG: soft lockup"

on Aug 11, maybe something didn't quite recover right from that. (Timestamp of the revision in question is 20130810173157)

Change 79178 had a related patch set uploaded by Springle:
remove db1009 from action while investigating bug 52853

https://gerrit.wikimedia.org/r/79178

Change 79178 merged by Springle:
remove db1009 from action while investigating bug 52853

https://gerrit.wikimedia.org/r/79178

(Back to new, Patch is just for removing a db-box from the rotation)

db1009 is depooled for investigation and, eventually, resync. Please shout if anyone still sees the issue.

db1009 revisions is missing rev_id's 38674172 -> 38674177. Several other tables checksummed (recentchanges, loging, page) also show small numbers of missing rows and/or missing updates.

Therefore db1009 is now doing a full checksum using the percona tools.

Based on missing row timestamps, the powercycle event seems the probable cause.

However, don't know yet how replication replication managed to *avoid* dying unless some funky disk syncing happened. Unfortunately relay logs don't go back that far (preciesly because the SQL thread thought all was fine and deleted them after processing).

(In reply to comment #16)

db1009 revisions is missing rev_id's 38674172 -> 38674177. Several other
tables checksummed (recentchanges, loging, page) also show small numbers of
missing rows and/or missing updates.

Therefore db1009 is now doing a full checksum using the percona tools.

If page_latest for the vote page is wrong for that slave, it would explain the other symptom people are experiancing ( wrong version coming up)

Can confirm some db1009 page records have wrong page_latest.

db1009 has been synced with the s2 master and returned to the pool.

The specific errors identified here (the missing rev ids, page_latest, etc) have been spot-checked and are intact. If anyone spots a recurrence, please shout.

One small observation: At least some of the original page_latest differences I observed were due to a difference in FLOAT/DOUBLE precision which threw off the checksum utility. Rounding to a fixed precision reduced the mismatches considerably, but did not eliminate them. Therefore I can't say definitively whether the symptom in comment 17 (wrong version) is due to this bug; just keep it in mind...

(In reply to comment #19)

One small observation: At least some of the original page_latest differences
I
observed were due to a difference in FLOAT/DOUBLE precision which threw off
the
checksum utility. Rounding to a fixed precision reduced the mismatches
considerably, but did not eliminate them. Therefore I can't say definitively
whether the symptom in comment 17 (wrong version) is due to this bug; just
keep
it in mind...

What? Page_latest should be an integer. Its a foreign key coresponding to rev_id.

page_random was different in some of the rows; I guess that's the floating vs double mentioned above. separately, page_latest was different in some rows as well, although not nearly as many.

page_random is only set on page insertion.

I also find strange that the tool noticed differences due to float/double. I expect it to consistently read the same data (no matter if float or double) in both systems...

It seems like the cause of this issue has been identified (comment 11) and the issue has been resolved (no further reports of the issue and the relevant DB has been re-synced and re-pooled).

I'm not sure there's much else to be done here. Can this bug report be marked resolved/fixed?