Page MenuHomePhabricator

Broken failover for DB slave connection errors
Closed, ResolvedPublic

Description

Author: ctwoo

Description:
Failure from reading from replicated database could bring down the rest of the site.

When the application tries to read a 'failed' db, it should catch the exception (error 2003?) and redirects it to another db. Instead, today, the application hangs, resulting in a site failure.


Version: unspecified
Severity: major

Details

Reference
bz29233

Event Timeline

bzimport raised the priority of this task from to High.Nov 21 2014, 11:36 PM
bzimport set Reference to bz29233.

Clarification: "DB" in this report seems to mean database server, so this is a load balancing thing.

More details from talking to CT (and my own surprise that this is not a "solved" problem):

  1. Apache's end up using a single slave DB for their DB access.
  2. When the DB goes down, client apps aren't smart enough to fail-over to the next available DB -- they sit there waiting until they timeout and fail.
  3. Apparently, because Apache/PHP is stupid and doesn't realize that it shouldn't try again, step 2 keeps repeating until the DB is fixed.

Apache/PHP doesn't select databases; MediaWiki's LoadBalancer class does. It should already be failing over to the next available server in the case of a connection error (MySQL error 2003 is "can't connect").

  • what's "failed db" mean?
  • what's "hangs" mean?
  • what's "site failure" mean?

Is this a situation where there *are* no other working slave servers to fail over to? What exactly is it doing?

CT, could you provide more details (per Brion's comments) so that Tim has a clue what is actually happening?

afeldman wrote:

During an 8 minute period on May 29 where mysql was down on db32 after being killed by the kernel due to an OOM condition but the server was otherwise up, en.wikipedia.org was observed to be down hard and 1209654 messages of the following type were logged:

Sun May 29 2:05:32 UTC 2011 srv230 enwiki Error connecting to 10.0.6.42: Lost connection to MySQL server at 'reading initial communication packet', system error: 111
Sun May 29 2:05:32 UTC 2011 srv170 enwiki Error connecting to 10.0.6.42: Can't connect to MySQL server on '10.0.6.42' (115)
Sun May 29 2:05:32 UTC 2011 srv175 enwiki Error connecting to 10.0.6.42: Lost connection to MySQL server at 'reading initial communication packet', system error: 111 (10.0.6.42)

From the message formatting, it appears that all of these messages were logged within DatabaseMysql::open().

The connection failures should have occurred within milliseconds and it does appear that the LoadBalancer class should handle such an occurrence with minimal impact. However, LoadBalancer::reportConnectionError only calls wfLogDBError() in the following two ways:

wfLogDBError( "LB failure with no last connection\n" );

wfLogDBError( "Connection error: {$this->mLastError} ({$server})\n" );

Neither of these messages appear in the dberror log during the enwiki / db32 outage.

(In reply to comment #3)

Apache/PHP doesn't select databases; MediaWiki's LoadBalancer class does. It
should already be failing over to the next available server in the case of a
connection error (MySQL error 2003 is "can't connect").

Yes, LoadBalancer has failover code. During the downtime on May 24, the appropriate sort of connection error was logged:

Tue May 24 13:41:01 UTC 2011 srv191 rowiki Connection error: No working slave server: No working slave server: Unknown error ()

This error indicates that the fallback sequence was exhausted, so whatever is going on, it's clear that we're not just letting connection error exceptions leak out of LoadBalancer. There were 4775 instances on that day.

If we get an error in Database::query(), then we don't close the connection and switch over to another database. I'm not sure if that's what CT is asking for. We don't appear to properly log "MySQL server has gone away" or "Lost connection to MySQL server during query" errors. They are dealt with by automatically reconnecting, and then if the reconnection fails, a DBConnectionError would probably be thrown.

There were 162,973 instances of "LB failure with no last connection" on May 24, which is somewhat concerning. But it's hard to know if that's the problem of interest.

What we really need to know is: when exactly was this "site failure", and what were the observed symptoms of it?

ctwoo wrote:

On that day, I noticed the EN site went down and was serving the error page when I tried to access en.wikipedia.org site. About 7 minutes later,when the db server came back up and the site came up as well.
Asher's earlier comment has more detail of the failure.
Tim - noticed you were referring to a problem on May 24. That problem happened on May 29.

afeldman wrote:

The errors around this outage begin at Sun May 29 2:05:21 UTC 2011 and end at Sun May 29 2:19:21 UTC 2011, see dberror.log-20110529.gz.

The fix in r90423 is now deployed.