Page MenuHomePhabricator

Need a way to simulate replication lag to test replag issues
Open, Stalled, LowPublicFeature

Description

We really need a easy place/way to test for possible replication lag issues in core and extensions. Currently it is just rolling a dice with (relatively) slow deployment cycle.


Version: unspecified
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=65394

Details

Reference
bz38945

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 12:50 AM
bzimport set Reference to bz38945.
bzimport added a subscriber: Unknown Object (MLST).

You mean like a labs instance with MediaWiki and two instances of MySQL on it, a master and an artificially delayed slave?

Something like that, yes. I don't know how to implement such thing technically, but I read once that some MySQL versions (probably not ours) have configuration option to add replication delay.

The features I am looking are:

  • artificially increased delay to make it easier to catch the issues
  • easy access for developers - should not be necessary to have someone else approve and deploy your commits while testing possible fixes

Sounds like something for Wikimedia to me, but not necessarily MediaWiki codebase.

What kind of bugs would we catch with this?

Having the environment is one thing, but what kind of tests do you have in mind (and how to run them).

Filed under continuous integration for now.

Depending on the kind of issues you want to test for and how it is implemented, it may be more suitable to have QA test this from the outside instead of with PHPUnit from Jenkins.

I was more concerned about actually reproducing the issues reliably and having possibility to debug them easily to understand the causes and to also to come up and test fixes without going through gerrit. I'm doubtful that you can do that with QA tests.

Right, so if I understand you correctly, you're looking for an environment where you can work on fixing bugs and testing bugfixes related to replication lag.

In other words, a wiki (say, lagged.wikipedia.beta.wmflabs.org) to do things with (as a human being).

Not a build step for continuous integration environment. Not a test suite for MediaWiki core.

If so, lets move this to as a feature request for labs. To set up a wiki there that is artificially lagged.

Adding some information from an email by Sean:

Nothing built into MariaDB 5.5, but Percona Toolkit has a decent tool:

http://www.percona.com/doc/percona-toolkit/2.2/pt-slave-delay.html

However it will depend on how accurate a delay is needed to be useful. The tool starts and stops the replication SQL thread predictably but the minimum time granularity is one transaction, which fluctuates, obviously.

Essentially a delay in the order of minutes is easy to maintain. Seconds... sort of.

Oracle's MySQL 5.6 has slave delay built-in using CHANGE MASTER TO MASTER_DELAY = <sconds>. The next MariaDB major relase may get that port -- havn't checked -- but that doesn't help us today.

  • the DB are Ubuntu Lucid instances with MySQL installed manually (aka no puppet class applied)

Ubuntu has percona toolkit packages in our repos. At least coredb have them installed by default. Only depends on perl.

Labs would be nice, but something that allows debugging and tweaking of the code would be even nicer. I wonder if it would be possible to do this with MediaWiki-Vagrant.

I doubt it but I know little about Vagrant; adding Ori to the loop.

When we migrated the beta cluster from pmtpa to eqiad, Sean Pringle added a master / slave setup on beta. Apparently the slave is usually laggy.

It seems to be possible to make it always lagged. Someone can reach out with Sean to figure out how to make it happen.

According to my latest knowledge the replication delay setting only exists in recent MySQL [1] and not in MariaDB - unless the feature has been added recently.

[1] I created a three server setup with replication manually. Unfortunately it did not survive an upgrade so it was broken before we got the chance to use it.

That might depends on the percona toolkit + some custom setup. I think production has slaves which have a 24 hours delay.

Someone should talk about it with Sean Pringle.

Production has slaves delayed by 24h using the MariaDB event scheduler [1] to start/stop the replication threads. This is fine for a coarse lag values of a few minutes, but inaccurate for anything less.

The MySQL 5.6 CHANGE MASTER TO MASTER_DELAY = N; (seconds) can be more accurate, roughly ~10s, but still highly dependent on the traffic generating the replicated events. Have also not seen it in action on our traffic, so... pinch of salt.

A series of 10+ second writes such as our periodic bot update/delete traffic on recentchanges or links can confuse both methods for short delays, with lag cycling between 0 and N*2.

It might be possible to achieve finer granularity on beta slave by interleaving something like FLUSH TABLES WITH READ LOCK on another thread (or another event) to ensure the slave thread does not catch up so easily.

So that is done by passed in $wgDBservers the option 'fakeSlaveLag'. The feature got introduced in MediaWiki core back in 2008 when Tim introduced LBFactory fbfb509df557ca9eef812f6645459c483149f186

The code in includes/db/LoadBalancer.php

$db->setLBInfo( $server );
 if ( isset( $server['fakeSlaveLag'] ) ) {
     $db->setFakeSlaveLag( $server['fakeSlaveLag'] );
 }

It is only supported by MySQL includes/db/DatabaseMysqlBase.php and let one return the given value when invoking getLag(), fake the getSlavePos() and cause masterPosWait() to sleep.

That is quite nice.

@Nikerabbit can you take the lead on this and figure out with rest of devs / Beta Cluster users whether it is a feature we should enable? I am all for it since it matches your use case, but I am wondering what other people will think about. If not just unassign yourself but I guess we will decline the task since it would lack a champion.

T59583 is a dupe?

That's a proposed solution, AFAIK.

hashar changed the task status from Open to Stalled.Nov 9 2015, 1:11 PM
hashar removed Nikerabbit as the assignee of this task.
Marostegui subscribed.

Since 10.2 (we are migrating away from 10.1 to 10.4) we have this integrated on mariadb: https://mariadb.com/kb/en/delayed-replication/
From the doc:
CHANGE MASTER TO master_delay=3600;

Krinkle added subscribers: RhinosF1, Krenair, Petrb and 2 others.

To do this by default [in Beta Cluster] seems incompatible with T87220. However, having an easy and documented way to induce lag seems useful indeed. Perhaps something one can [temporary] cause by running a CLI command form a beta host. Or by having a depooled replica that is always lagged that can be selected via a WikimediaDebug option perhaps.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:14 AM