memcached-serious log flooded with TIMED RETRY errors
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ori
	Nov 11 2013, 5:28 AM

Description

memcached-serious.log on fluorine receives about half a million of these every hour:

2013-11-11 05:23:25 mw36 enwiki: Memcached error for key "enwiki:resourceloader:filter:minify-css:7:5fa71091f95043d13fcefbea197f4233" on server "127.0.0.1:11211": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY

It has been like that for several months now. It could be https://bugs.php.net/bug.php?id=60049: "Calling get with an invalid key and after that getMulti with a valid key does not work". This would explain why the keys that we're seeing in the log look sane; they are not the ones that trigger the failure. Could also be a twemproxy thing; it would be good to correlate the rate of errors with twemproxy deployment.

Version: wmf-deployment
Severity: normal

Details

Reference: bz56882

Event Timeline

• bzimport raised the priority of this task from to High.Nov 22 2014, 2:27 AM

• bzimport added projects: WMF-General-or-Unknown, platformeng.

• bzimport set Reference to bz56882.

ori created this task.Nov 11 2013, 5:28 AM

Possibly related to https://bugs.launchpad.net/libmemcached/+bug/928696

The keys "enwiki:newtalk:ip:10.0.0.14" and "enwiki:newtalk:ip:10.0.0.13" are preponderant in the logs and odd to behold. Possibly a different bug that is getting exposed by this one.

(In reply to comment #2)

The keys "enwiki:newtalk:ip:10.0.0.14" and "enwiki:newtalk:ip:10.0.0.13" are
preponderant in the logs and odd to behold. Possibly a different bug that is
getting exposed by this one.

Those are LVS servers. Pybal is regularly requesting http://en.wikipedia.org/wiki/Main_Page on all backends, which is essentially the only MW traffic to the pmtpa apaches.

The error message indicates that a libmemcached memcached_connect() call gave a result of MEMCACHED_SERVER_TEMPORARILY_DISABLED. This can happen if memcached_mark_server_for_timeout() was called on the server, which can happen if memcached_quit_server is called with io_death=true, which can happen in all sorts of different cases in io.cc and in one case in response.cc. More investigation is needed to determine exactly which case is causing it.

Unfortunately, MEMCACHED_BEHAVIOR_RETRY_TIMEOUT (i.e. retry_timeout in our config) has a minimum of one whole second. In the case of PHP talking to twemproxy, immediate reconnection would probably be a better policy. The one-second timeout is why we see floods of messages in bursts that last approximately one second each.

The fact that pmtpa apaches responding to pybal monitoring requests are heavily represented in the logs may be caused by transient packet loss on the pmtpa to eqiad link, which would increase the rate of connection timeouts.

Here's an aggregation of error messages in the current memcached-serious.log:

1 ITEM TOO BIG
2081 CONNECTION FAILURE
8199483 SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY
9 SERVER ERROR
101 A TIMEOUT OCCURRED
3 A BAD KEY WAS PROVIDED/CHARACTERS OUT OF RANGE

If you exclude TIMED RETRY messages, you get only 27 different appservers, whereas there are 387 appservers in the log overall. CONNECTION FAILURE and A TIMEOUT OCCURRED can lead to a flood of TIMED RETRY messages afterwards, but obviously they can't account for most of the TIMED RETRY messages. In fact, all 2081 CONNECTION FAILURE messages in this log came from snapshot3. So we need to look for unlogged causes of servers being marked for timed retry.

[Tim: Your browser seems to reset the Priority field.]

A potential workaround is to reduce the retry timeout to zero. I have written a libmemcached patch which allows this. Faidon will deploy it. The patch has been submitted upstream at https://bugs.launchpad.net/libmemcached/+bug/1251482

(In reply to comment #6)

[Tim: Your browser seems to reset the Priority field.]

Maybe priority changes don't trigger the "mid-air collision" page.

I set the timeout to -1 on production...which is kind of hacky looking at the C code, but should work for now. Tim's upstream patch seems to be merged as well.

Change 113387 had a related patch set uploaded by Aaron Schulz:
Set retry_timeout to -1 for memcached in eqiad only

https://gerrit.wikimedia.org/r/113387

Change 113387 merged by jenkins-bot:
Set retry_timeout to -1 for memcached in eqiad only

https://gerrit.wikimedia.org/r/113387

(In reply to Gerrit Notification Bot from comment #10)

Change 113387 had a related patch set uploaded by Aaron Schulz:
Set retry_timeout to -1 for memcached in eqiad only

https://gerrit.wikimedia.org/r/113387

This doesn't work for tampa, so most errors will remain. I guess they can probably be ignored.

Patch merged for Eqiad; somebody needs to decide if it's worth to do something about Tampa or not (and close the ticket accordingly).

All patches merged; resetting ticket status

(In reply to Andre Klapper from comment #13)

Patch merged for Eqiad; somebody needs to decide if it's worth to do
something about Tampa or not (and close the ticket accordingly).

Tampa is deaded

memcached-serious log flooded with TIMED RETRY errorsClosed, ResolvedPublicActions

Description

Details

Event Timeline

memcached-serious log flooded with TIMED RETRY errors
Closed, ResolvedPublic
Actions