Page MenuHomePhabricator

Flow: don't perform reads on DB_MASTER
Closed, DeclinedPublic

Description

There's a couple of places where we perform read queries against DB_MASTER. We shouldn't be doing that, it kind of defeats the purpose of having slave DBs anyway, if we keep directing reads to the same master server, and it may eventually get overloaded.

Breaking down the problem:

  • we have to make sure no (potentially) lagged slave data is cached.

We only cache DB-hits, not DB-misses. That's good, this means that if a lagged DB says some data is not there, that won't be saved to cache where it could persist for 3 more days. Instead, as soon as slave DBs have caught up, the most recent changes will appear.

However, there may be cases where we've *updated* existing DB data. In that case, it may still be possible for a lagged DB server to respond with (outdated) data that will be stored in cache until it expires.

  • we have to make sure chronology is respected.

Core has ChronologyProtector in place to make sure that every consecutive request is always served by a DB at least as recent as the previous request. This already works on external DBs too (https://gerrit.wikimedia.org/r/#/c/58691/)

  • we have to make sure same-request queries don't query lagged slaves after data was written to master.

ChronologyProtector only kicks in at the beginning of a request, not for every query. So when we're performing writes to DB_MASTER, we should make sure that any follow-up read query to DB_SLAVE is to a slave that is at least in the same state DB_MASTER is in.

In AFTv5, I've fixed this by checking in the own getDB() wrapper. If $db is DB_MASTER, I
mark some variable as true, and make sure that any follow-up DB_SLAVE call is converted to DB_MASTER if that variable was set to true. This auto-direct all (and only those) calls to DB_SLAVE if we've just written new data to master.
That code is at: https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FArticleFeedbackv5/60ea34d65207b5b71a6a617ff0a2129326ac9442/data%2FDataModelBackend.LBFactory.php#L36

We should address #3, and make sure that #1 can never happen.

Anything else I missed?


Version: unspecified
Severity: normal

Details

Reference
bz62528

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:00 AM
bzimport set Reference to bz62528.
bzimport added a subscriber: Unknown Object (MLST).

And, obviously, after those issues have been fixed, we should actually change all getDB( DB_MASTER ) calls for read-purposes to DB_SLAVE ;)

I think we've got this pretty much covered ATM. Let's close this and file new bugs should specific issues arise later.