Page MenuHomePhabricator

Some revisions seem to have disappeared on it.wikipedia
Closed, ResolvedPublic

Description

Details

Reference
bz56577

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 2:37 AM
bzimport set Reference to bz56577.
bzimport added a subscriber: Unknown Object (MLST).

The archive rows appear on the master but not on the slaves:

mysql:root@db1027 [frwiki]> select count(*) from archive;
+----------+

count(*)

+----------+

4476558

+----------+

mysql:root@db1040 [frwiki]> select count(*) from archive;
+----------+

count(*)

+----------+

4469921

+----------+

The difference is 6637 missing archive rows.

Sorry, the above comment should have gone on bug 56589 (which this will probably be duped to anyway)

The same calculation shows 4371 missing archive rows on itwiki.

We have some MariaDB replication slave drift due to our use of statement-based
binlogs (necessary for a further 6+ months, for operational reasons, before a
switch to row-based).

This is an onoing problem in general, but in this case it's also been
exacerbated by recent schema changes on the archive table.

No data is actually missing and the affected slaves are being resynchronized.

SBR is what we've always used, how can it cause "slave drift"? Is there a particular bug with it in MariaDB?

How did you do the ar_id schema change? I remember when we added log_id, it wasn't possible to just do an ALTER TABLE on each slave, because the resulting IDs wouldn't be synchronised. We had to recreate the table and insert all rows into it. Is that a fixed bug, or did you work around that somehow?

Slave drift is not a particular bug with MariaDB, but a series of possible problems mostly related to statements that are non-deterministic plus a few fringe cases.

http://dev.mysql.com/doc/refman/5.5/en/replication-rbr-safe-unsafe.html

Since we upgraded to MariaDB 5.5 our mysqld error logs have been steadily growing with warnings related to unsafe binary log statements that could cause a Slave to drift out of sync with its master. An example (not specifically related to this bug):

131028 5:58:22 [Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. INSERT... ON DUPLICATE KEY UPDATE

These warning messages didn't appear as frequently in MySQL 5.1 error logs, not because it was safer then, but because they're additional warnings added to MariaDB 5.5 to identify possible problems exposed in bug reports filed *since* 5.1 went GA. They indicate that our slave datasets have been slowly drifting for some time.

I've been investigating and correcting the problem using the pt-table-sync tool [1]. Unfortunately that tool requires a primary or unique key on each table to operate efficiently, which archive did not have, hence it may have been further out of sync on slaves than other tables. This is also the reason I wanted to find a way to do the schema change on the master rather than rotating a slave.

In this case archive schema was modified by:

  • Delaying a slave as fallback
  • Creating a copy or archive as _archive_new
  • Altering _archive_new to add ar_id PK
  • Altering _archive_new to add ar_hash (an md5 hash of the row, for sanity checks)
  • Adding INSERT/UPDATE/DELETE triggers to archive to keep _arcive_new in sync
  • Batch inserting data from archive to _archive_new
  • Cross checking the two tables for differences
  • Removing ar_hash
  • Switching the two tables
  • Keeping old table around as archive_save for a time

The batch transfer step was done with pt-online-schema-change using the temporary ar_hash as key (normally it would use the table primary key, but of course archive didn't have one).

[1] As there may yet be more unforseen bugs with SBR, the only true fix is to switch to RBR.

Oops forgot to add:

The batch transfer process used a REPLACE INTO ... SELECT query walking the archive.ar_rev_id index in fixed sizes. This statement in some forms is one that gets listed as unsafe for SBR.

My initial checks on slaves showed it should have been ok there, but the focus was always to ensure master accuracy first and continue to check slave sync regularly while SBR is in effect.

This bug occuring around the time of the schema change isn't likely to be a co-incidence. Hence me placing SBR in the spotlight.

Thanks for that.

If MediaWiki is generating any queries which are actually causing slave drift, you should file bugs for that. We may be able to work around it on the application side.

(In reply to comment #8)

Thanks for that.

If MediaWiki is generating any queries which are actually causing slave
drift,
you should file bugs for that. We may be able to work around it on the
application side.

At least for these two pages it seems to work again.