Page MenuHomePhabricator

Jobrunner will fail to spawn jobs on HHVM
Closed, ResolvedPublic

Description

simple script that reproduces the issue in production.

When running on HHVM, the jobrunner service (configured to use fcgi I suppose) fails to spawn curl requests with the following errors:

[Tue Aug 12 09:25:36 2014] [hphp] [12782:7f0b86813700:0:033896] []
Warning: fork failed - Cannot allocate memory in /srv/deployment/jobrunner/jobrunner/redisJobRunnerService on line 933
[Tue Aug 12 09:25:36 2014] [hphp] [12782:7f0b86813700:0:033897] []
Notice: Undefined index: 1 in /srv/deployment/jobrunner/jobrunner/redisJobRunnerService on line 935
[Tue Aug 12 09:25:36 2014] [hphp] [12782:7f0b86813700:0:033898] []
Notice: Undefined index: 2 in /srv/deployment/jobrunner/jobrunner/redisJobRunnerService on line 936
[Tue Aug 12 09:25:36 2014] [hphp] [12782:7f0b86813700:0:033899] []
Notice: Undefined index: 0 in /srv/deployment/jobrunner/jobrunner/redisJobRunnerService on line 938
[Tue Aug 12 09:25:36 2014] [hphp] [12782:7f0b86813700:0:033900] []
Warning: Not a valid stream resource in /srv/deployment/jobrunner/jobrunner/redisJobRunnerService on line 938
2014-08-12T09:25:36+0000: Could not spawn process in loop 0: curl -XPOST -s -a 'http://127.0.0.1:9002/rpc/RunJobs.php?wiki=glkwiki&type=ChangeNotification&maxtime=60&maxmem=300M'

I tried various tweaks (like raising the memory limit both in the JR script and in hhvm) but nothing seemed to work around this.

This seems to be a general problem with hhvm as configured by us btw, I wrote a small script that just forks with proc_open a curl request for enwiki main page, and it spawns the same error (see attachment).


Version: unspecified
Severity: blocker

Attached:

Details

Reference
bz69428

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:40 AM
bzimport added a project: WMF-JobQueue.
bzimport set Reference to bz69428.
bzimport added a subscriber: Unknown Object (MLST).

This happens with our packages, of course.

Not seeing this with:

sudo -u apache /usr/bin/php /srv/deployment/jobrunner/jobrunner/redisJobRunnerService --config-file=/etc/jobrunner/jobrunner.conf --verbose

Also, running some of the curl commands it does gives normal, expected, JSON replies.

22:40 <godog> btw from the issue above there we got the core dumped on mw1053:/tmp via the usual script
22:42 <godog> it looks like this too http://ganglia.wikimedia.org/latest/?r=day&cs=8%2F14%2F2014+5%3A41&ce=8%2F14%2F2014+21%3A13&c=Jobrunners+eqiad&h=mw1053.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=ALLGROUPS

if we don't get any specific clue from the core file on what was going on we could try and disable some job types and see what that does