Page MenuHomePhabricator

Replicated database of dewiki is corrupted
Closed, ResolvedPublic

Description

max(rc_timestamp) stuck at 20131126202307.


Version: unspecified
Severity: critical

Details

Reference
bz57642

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 2:23 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz57642.

Replication of wikidatawiki and dewiki startet again yesterday, but than stopped again.

max(rctimestamp) is 20131127003303

for both: wikidatawiki and dewiki

  • Bug 57645 has been marked as a duplicate of this bug. ***

Replication for dewiki seems to be working now, but still two third of all revisions are missing.

SELECT count(*) FROM revision;
-> 41,249,221 (Special:Statistics: 130,634,662)

SELECT count(*) FROM page WHERE page_namespace=0;
-> 2,777,194 (~correct)

SELECT count(*) FROM page, revision WHERE page_latest=rev_id AND page_namespace=0;
-> 374,962 (!)

(In reply to comment #4)

Replication for dewiki seems to be working now, but still two third of all
revisions are missing.

wikidatawiki seems not affected.

MariaDB [wikidatawiki_p]> SELECT count(*) FROM page WHERE page_namespace=0;
+----------+

count(*)

+----------+

14117708

+----------+
1 row in set (3.12 sec)

MariaDB [wikidatawiki_p]> SELECT count(*) FROM page, revision WHERE page_latest=rev_id AND page_namespace=0;
+----------+

count(*)

+----------+

14117708

+----------+
1 row in set (6 min 46.84 sec)

wikipedia wrote:

Still many tools on Tool Labs are broken due to this bug. Please fix as soon as possible.

See also discussion in de.wikipedia:
https://de.wikipedia.org/wiki/Wikipedia_Diskussion:Kurier#Toolserver.2FLabs-Probleme

I changed back the priority to "Highest" as a fix within "one to six months" is way too slow. It actually makes Tool Labs currently not usable for German tool users (and coders).

Hi Andre, when you notice database issues, please CC Sean Pringle for investigation (doing so now).

The issue is on labsdb1002 and is a flow-on effect from this incident http://lists.wikimedia.org/pipermail/labs-l/2013-November/001883.html . The labsdb1002:3308 dewiki.revision table is still being synced from upstream by pt-table-sync and it is affecting replication. Context:

  • Originally replication was stopped completely and a full dump/restore from upstream dewiki was done, however labsdb1002:3308 mysqld crashed in the process (see below). The revision table was only partially restored.
  • To avoid blatting labs user data with a full rebuild affecting all wikis, I switched to using pt-table-sync with replication on the weekend to bring revision back up to full row count.

However labsdb1002 has since crashed again with the kernel OOM killer sniping mysqld:3308. The sync process is batched and low footprint (where the dump method was not) but other labsdb txns must still be slowed down enough to add up to an infrequent mem usage spike.

Therefore yesterday I reduced the InnoDB buffer pool size for all three labsdb1002 mysqld instances by 25%. OOM killer has not struck since and based on row counts the dewiki.revision sync process should resolve within the next 12h.

Do you also repair the externallinks table which is also incomplete?

metatron wrote:

Current water line:

MariaDB [dewiki_p]> SELECT count(*) FROM revision union SELECT count(*) FROM archive;
+-----------+

count(*)

+-----------+

114264012
10414140

+-----------+
2 rows in set (26.56 sec)

So sum is: 124,678,152
API site stats report: 130.742.732
Difference: −6,064,580

Has replication already finished? If yes, how can this difference be explained?

Toolserver doesn't remove some lines that it should, which are in fact removed on Tool Labs; mostly having to do with revision deletion and suppression (so the data wouldn't be available anyways).

There is still revisions missing for dewiki (e.g. 125050920, 125087630, 125137961...).

flaggedpages has errors. Example:

MariaDB [dewiki_p]> SELECT * FROM flaggedpages WHERE fp_page_id=8507;
+------------+-------------+-----------+------------+------------------+

fp_page_idfp_reviewedfp_stablefp_qualityfp_pending_since

+------------+-------------+-----------+------------+------------------+

850711250548950NULL

+------------+-------------+-----------+------------+------------------+
1 row in set (0.03 sec)

But 125054895 isn't the current stable version, it should be 125144014, see http://de.wikipedia.org/w/index.php?title=Reformation&action=history

Replication stopped again for dewiki:

MariaDB [dewiki_p]> SELECT max(rc_timestamp) FROM recentchanges;
+-------------------+

max(rc_timestamp)

+-------------------+

20131207135755

+-------------------+
1 row in set (0.04 sec)

Replication for dewiki works again, flaggedpages seems to be correct, too.

But there are still missing revisions in the revision table: I don't know if these are the six million stated above (SELECT (SELECT count(*) FROM revision)+(SELECT count(*) FROM archive); vs. Special:Statistics), but there are missing revisions. Examples:

125050920 - https://de.wikipedia.org/?oldid=125050920

MariaDB [dewiki_p]> SELECT * FROM revision WHERE rev_id=125050920;
Empty set (0.04 sec)

Same for 125087630 and 125137961. These are all revisions from December 2nd or December 6th.

MariaDB [dewiki_p]> SELECT count(*) FROM page WHERE page_namespace=0;
+----------+

count(*)

+----------+

2782237

+----------+
1 row in set (0.73 sec)

MariaDB [dewiki_p]> SELECT count(*) FROM page, revision WHERE

-> page_latest=rev_id AND page_namespace=0;

+----------+

count(*)

+----------+

2781856

+----------+
1 row in set (22.13 sec)

These two numbers should be the same.

The revisions from Comment 14 are back, but there are still several issues with the dewiki database. Maybe it's possible to do a "full comparison" or something like that?

Three examples:

The Talk page of "Clear_Cola" has the page_id 8005309 and page_latest is 67643465. This revision exists in the revision table, but rev_page for this revision is 4934436, which should be 8005309. page_id 4934436 does not exist.

The article "Morrill_Gesetz" (page_id 8004783) was deleted three days ago. The revisions are gone, but the article is still in the page table.

page_latest for the article "Boris_Zemelman" (page_id 8005384) is 125330034, this revision is missing from the revision table.

Could we please have an update on this?

The dewiki database on labs was dumped and reloaded with the buffer pool still reduced in size as per comment 8 -- the earlier resync process was too slow.

Things should be back to normal, at least to the point of consistency with the upstream sanitarium after data redaction.

Looks good now. I can't see any of the above mentioned errors anymore. Marking this as resolved.