Page MenuHomePhabricator

Random 503 Service Temporarily Unavailable errors from tools-webproxy
Closed, ResolvedPublic

Description

Author: metatron

Description:
Since yesterday, some new errors randomly occur:

503 Service Temporarily Unavailable
followed by
The connection was reset

After reloading the page 2-4x, everything is back to normal.

So, none of the old friends: 404/OOM, 500 errors from lighttpd (works), but new ones. Looks like something from tools-webproxy.


Version: unspecified
Severity: major
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=65272

Details

Reference
bz65179

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:18 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz65179.

I think I'm hitting the same issue:

$ curl -I "http://tools.wmflabs.org/mzmcbride/"
curl: (52) Empty reply from server

$ curl -I "http://tools.wmflabs.org/mzmcbride/"
HTTP/1.1 503 Service Temporarily Unavailable
Server: nginx/1.7.0
Date: Sun, 11 May 2014 08:36:09 GMT
Content-Type: text/html
Connection: keep-alive
X-Powered-By: PHP/5.3.10-1ubuntu3.10+wmf1

metatron wrote:

+ a new one
Error code: ERR_SSL_PROTOCOL_ERROR

also disappears after some reloadings.

metatron wrote:

Since there's a coincidence with the gzip modification yesterday, how about removing that patch?

We also upgraded nginx yesterday, so that might also be the reason.

There are also issues with Pywikibot's nightlies stopping transfer after ~50kB. Might be related, but there are no 500's involved.

https://bugzilla.wikimedia.org/65272

Is this still happening? I rolled back the nginx change right after making that comment (and mentioned on IRC, but didn't get time to response here - sorry about that), so if it is 'gone' it is just an nginx newer version issue.

It's definitely happening right now:

503 Service Temporarily Unavailable

nginx/1.5.0

metatron wrote:

Here's a feedback:
After the rollback everything went back to usual (not normal). Right now 503 on all channels.

warnckew wrote:

Same problem, and I see no entries in the access.log or error.log. lighttpd appears to be running, thus it seems HTTP requests don't make it past the proxy.

  • Bug 65272 has been marked as a duplicate of this bug. ***

Change 133172 had a related patch set uploaded by Yuvipanda:
dynamicproxy: Use redis connection pooling

https://gerrit.wikimedia.org/r/133172

Yuvi currently has no power for his laptop, but he commented on IRC:

<yuvipanda_> mutante: and then I looked at the logs and the problem was that
there were just too many connections hanging around, since redis
is single threaded but nginx has multiple workers and I had set a
1s connection timeout but not set a connection pool
<yuvipanda_> mutante: so now I've a connection pool with 32s timeouts for
purging from the pool plus a 128 max connections limit, which
should work [22:46]
<yuvipanda_> scfc_de: can you comment on the bug saying this was the problem
and the solution is to restart redis on tools-webproxy, for now
at least? I don't have my primary machine with me now [22:47]

Change 133172 merged by Andrew Bogott:
dynamicproxy: Use redis connection pooling

https://gerrit.wikimedia.org/r/133172