Page MenuHomePhabricator

beta.wmflabs.org unreachable (503 error) after migration to eqiad
Closed, ResolvedPublic

Description

http://en.m.wikipedia.beta.wmflabs.org/ currently gives the following response:

Request: GET http://en.wikipedia.beta.wmflabs.org/, from 127.0.0.1 via deployment-cache-mobile03 deployment-cache-mobile03 ([127.0.0.1]:3128), Varnish XID 723150036
Forwarded for: 216.38.130.164, 127.0.0.1
Error: 503, Service Unavailable at Mon, 31 Mar 2014 18:09:57 GMT


Version: unspecified
Severity: blocker

Details

Reference
bz63315

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:00 AM
bzimport set Reference to bz63315.
bzimport added a subscriber: Unknown Object (MLST).

Request: GET http://de.wikipedia.beta.wmflabs.org/wiki/, from 127.0.0.1 via deployment-cache-text02 deployment-cache-text02 ([127.0.0.1]:3128), Varnish XID 105897343
Forwarded for: 78.94.xxx.xxx, 127.0.0.1
Error: 503, Service Unavailable at Mon, 31 Mar 2014 18:17:47 GMT

Request: GET http://bits.beta.wmflabs.org/images/wikimedia-button.png, from 78.94.153.111 via deployment-cache-bits01 deployment-cache-bits01 ([10.68.16.12]:80), Varnish XID 90403984
Forwarded for: 78.94.xxx.xxx
Error: 503, Service Unavailable at Mon, 31 Mar 2014 18:40:55 GMT

pages while logged out (no cookies) are basically served or are hitting cache(?), but bits also doesn't work.
sometimes also the connection times out

Change 122436 had a related patch set uploaded by Hashar:
beta: lower # of procs on jobrunner

https://gerrit.wikimedia.org/r/122436

The CirrusSearch update job kicked it and started parsing the whole simplewiki which is a big large for the beta cluster. Due to our jobrunner (deployment-jobrunner01) being configured like production (launching a lot of jobs), the jobs were starving the application servers by querying /w/api.php ...

I lowered the number of job runners with https://gerrit.wikimedia.org/r/#/c/122436/

There might be some other issue.

I tried restarting both apaches, without much success. Eventually killed the parsoid daemon which was spamming the application server as well.

The root cause is definitely parsoid doing a lot of queries on the Api service.

So Parsoid was attempting to parse all of simplewiki. I have stopped the daemon and restarted it. Monitoring /var/log/parsoid/parsoid.log it is all quiet on that front now so the API application servers are no more hammered.

Also bits might be

Also bits might be fully loaded by now.

I think the issue is solved now. Root cause was Parsoid attempting to fetch a bunch of page info from the API server for some reasons. Restarting Parsoid apparently stopped the spam.

(In reply to Daniel Zahn from comment #9)

would that make https://gerrit.wikimedia.org/r/#/c/122436/ obsolete or not

That one lower the number of jobs run in parallel on the jobrunner01 instance. Unrelated but still a good thing to have, the instance is less powerful than our prod servers.

Change 122436 merged by Alexandros Kosiaris:
beta: lower # of procs on jobrunner

https://gerrit.wikimedia.org/r/122436