Page MenuHomePhabricator

HHVM fcgi restart during scap runs cause 503s (and failed tests)
Closed, ResolvedPublic

Details

Reference
bz72366

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:57 AM
bzimport set Reference to bz72366.
bzimport added a subscriber: Unknown Object (MLST).

adding Ori in case this might be a consequence of HHVM

Reedy/Antoine: Can you two diagnose this one, please?

Chris: Having more than two data points would be helpful (there might be a pattern eg tests running while code/db updates).

Greg: it happens so often that I hadn't tried to track it closely.

503s could very well could be caused by https://gerrit.wikimedia.org/r/#/c/163078/ which restarts the hhvm server as part of each scap. While the hhvm fcgi process is shutting down and starting back up Apache would be likely to return a 503. I actually commented on this outcome in code review:

In beta I'm seeing errors that line up with hhvm restarts:
[proxy_fcgi:error] [pid 10116] [client 10.68.16.12:13867] AH01079: failed to make connection to backend: 127.0.0.1
In beta we don't use pybal so there is no way at all to depool/repool around a
restart. In production I don't think that we have any automatable means for
depooling/repooling at this time. How worried should I be about a 503 storm if we put
this into production? Without depooling it is going to happen.

That is a really strong culprit....

Greg: it happens so often that I hadn't tried to track it closely.

How often do we run scap?

(In reply to Ori Livneh from comment #8)

Greg: it happens so often that I hadn't tried to track it closely.

How often do we run scap?

On every commit to core or an extension.

163078 was merged 14 October, the timing looks good for this being the culprit.

It seems to occur with increasing frequency, but that could be my bias.

(In reply to Chris McMahon from comment #10)

163078 was merged 14 October, the timing looks good for this being the
culprit.

It seems to occur with increasing frequency, but that could be my bias.

If restarting hhvm is the cause, the frequency would depend on the level of gerrit +2 activity, the number of tests running and the time needed for the hhvm fcgi process to restart. Mostly impossible to calculate before hand. Without a method to de-pool and drain an hhvm fcgi service before restarting it it will be quite likely that there are some failed requests.

(In reply to Bryan Davis from comment #9)

(In reply to Ori Livneh from comment #8)

Greg: it happens so often that I hadn't tried to track it closely.

How often do we run scap?

On every commit to core or an extension.

Pedantic:
https://git.wikimedia.org/blob/integration%2Fconfig/HEAD/jjb%2Fbeta.yaml#L196

every 10 minutes we kick off a run

(In reply to Bryan Davis from comment #11)

(In reply to Chris McMahon from comment #10)

163078 was merged 14 October, the timing looks good for this being the
culprit.

It seems to occur with increasing frequency, but that could be my bias.

If restarting hhvm is the cause, the frequency would depend on the level of
gerrit +2 activity, the number of tests running and the time needed for the
hhvm fcgi process to restart. Mostly impossible to calculate before hand.
Without a method to de-pool and drain an hhvm fcgi service before restarting
it it will be quite likely that there are some failed requests.

Is this (where "this" == "method to de-pool and drain an hhvm fcgi service before restarting") something that is planned to be added to scap for production? If so, when can we expect that to be in place?

(In reply to Greg Grossmeier from comment #13)

(In reply to Bryan Davis from comment #11)

If restarting hhvm is the cause, the frequency would depend on the level of
gerrit +2 activity, the number of tests running and the time needed for the
hhvm fcgi process to restart. Mostly impossible to calculate before hand.
Without a method to de-pool and drain an hhvm fcgi service before restarting
it it will be quite likely that there are some failed requests.

Is this (where "this" == "method to de-pool and drain an hhvm fcgi service
before restarting") something that is planned to be added to scap for
production? If so, when can we expect that to be in place?

Giuseppe is interested in adding etcd support to pybal that could make this possible but I don't think there is a timeline for it. Also beta does not use pybal so that is more work as well.

I'm a bit concerned about the possibility of 503 storms in prod as well.

(In reply to Bryan Davis from comment #14)

Giuseppe is interested in adding etcd support to pybal that could make this
possible but I don't think there is a timeline for it. Also beta does not
use pybal so that is more work as well.

/me nods

I'm a bit concerned about the possibility of 503 storms in prod as well.

Ditto.

Also, I hate to say it, but if this is causing Chris and team (and other teams) a ton of pain (it seems it is, and potentially could be the cause of all the flack "OMG! unstable Beta Cluster!!!!11" we're getting lately, that's not clear yet)... I don't know what to do.

I think someone should actually do the legwork to investigate this issue by cross-referencing 503s against the log files before we get carried away thinking about how we're going to solve the possible manifestation of this possible problem in production.

(In reply to Ori Livneh from comment #17)

I think someone should actually do the legwork to investigate this issue by
cross-referencing 503s against the log files before we get carried away
thinking about how we're going to solve the possible manifestation of this
possible problem in production.

From the above comment:

https://integration.wikimedia.org/ci/job/beta-scap-eqiad/27596/
corresponds to
https://saucelabs.com/jobs/eac7a841b8ff4b909419f7418a6b2194

Before I do more.. that's a losing battle (clocks aren't sync'd between SauceLabs and Jenkins/Beta, Browser tests aren't the way to tell if 503s are occuring, etc). We need monitoring of the response codes for Beta Cluster. If the fcgi is causing the monitoring to see 503s often, then we know that's the issue.

Can we agree that doing a scap every 10 minutes shouldn't cause 503s to randomly hit users and browser tests? We have browser tests that take longer than 10 minutes.

When the change "make scap restart HHVM" ( https://gerrit.wikimedia.org/r/#/c/163078/ ) has been deployed on the beta cluster, Bryan Davis immediately noticed the apache timeout issue. Quoting him:

In beta I'm seeing errors that line up with hhvm restarts:

[proxy_fcgi:error] [pid 10116] [client 10.68.16.12:13867] AH01079:
failed to make connection to backend: 127.0.0.1

In beta we don't use pybal so there is no way at all to depool/repool around
a restart. In production I don't think that we have any automatable means for
depooling/repooling at this time. How worried should I be about a 503 storm
if we put this into production? Without depooling it is going to happen.

And I followed up with:

A few random ideas:

More proxy: Assuming HHVM restart is reasonably fast, could we set a timeout
in Apache ? That would mean adding mod_proxy in between Apache and hhvm
fcgi which will come with a bunch of potential failures on its own :/

Add in pybal some RPC to remotely depool/repool a node

Does hhvm supports a graceful stop? ie stop accepting new connections (that
would make pybal to unpool the node) but keep processing ongoing requests ?

Seems we need to make HHVM smarter ala "apache graceful-stop".

On beta cluster, the load balancing is handled by the backend Varnish which have the web servers as backend. There might be a way to ask the Varnishes to stop sending new connections to a web server that is about to be restarted.

Finally, the patch was made to clear HHVM caches for JIT PCRE, both might need a feature request to be loaded with HHVM upstream so it can be reclaimed automatically or forced from time to time.

When we talked with Emir from Facebook this spring he outlined how they handle restarts for each node:

  • Load balancer polls /status.php on each server to determine if it should receive requests
  • /status.php checks for a /hhvm/var/stopping file and if it exists returns a status to tell the LB not to send more requests

Stop

  • touch /hhvm/var/stopping to trigger depool response to LB
  • wait up to 30s for requests to drain
    • poll the built-in admin server to find current request load (/check-load)
    • if >0: sleep 5s
  • send stop signal to server via built-in admin server's /stop command
  • send SIGTERM to hhvm processes via pkill
  • poll /status.php once a second for up to 30s looking for a connection failure
  • send SIGKILL to hhvm processes via pkill
  • remove /hhvm/var/stopping trigger file

Start

  • remove /hhvm/var/stopping trigger file
  • warm OS cache of hhvm binary and hhbc cache via cat >/dev/null
  • start hhvm process
  • poll admin server once per second for up to 10s to see if it started
    • exit 1 if not seen
  • poll /status.php
    • exit 1 if you don't get a valid response

Restart

  • stop
  • start

See also https://github.com/emiraga/hhvm-deploy-ideas/blob/master/2restart/2restart.sh

Thank you for the high level synopsis, that seems to be a good base to add graceful/graceful-stop to the init/upstart script :-)

Pybal ( operations/debs/pybal ) has different monitors under pybal/monitors. the proxy fetch one uses twisted to do some HTTP queries. Seems similar.

  • Bug 73010 has been marked as a duplicate of this bug. ***

gerritadmin wrote:

Change 171277 had a related patch set uploaded by BryanDavis:
Revert "Make scap restart HHVM"

https://gerrit.wikimedia.org/r/171277

gerritadmin wrote:

Change 171277 merged by jenkins-bot:
Revert "Make scap restart HHVM"

https://gerrit.wikimedia.org/r/171277

I have removed the hhvm restart step from scap for now. I'm pretty sure that we will want to put it back in some form in the long term, but for now it is causing more pain than good. Let's see if this makes the mystery 503s go away.

Assuming not restarting HHVM every 10 minutes solved the issue :)