Page MenuHomePhabricator

Dying workers are not always restarted
Closed, ResolvedPublic

Description

In production it seems that dying workers (due to exceptions) are not always restarted. In some cases there are no 'restarting' messages at all in nohup.out despite most workers having disappeared.

Production is running node 0.8.2 and latest node_modules as of today.


Version: unspecified
Severity: normal

Details

Reference
bz49599

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:51 AM
bzimport added a project: Parsoid-Web-API.
bzimport set Reference to bz49599.

Command to get an overview about the number of node processes in the parsoid group:
dsh -g parsoid 'echo -n "hostname "; ps aux | grep node | wc -l'

I'm going to tackle this one today, first by trying to determine if unix signals, OOM, or stack crashers can reproduce this problem. gwicke indicates that simple exceptions aren't enough to reproduce.

We are currently registering for a 'death' event, but that is no longer available in cluster 0.8.17 (http://nodejs.org/dist/v0.8.17/docs/api/cluster.html) nor in 0.10. So it seems that we need to register for 'disconnect' and/or 'exit' instead.

Related URL: https://gerrit.wikimedia.org/r/69151 (Gerrit Change I2b7119c928ed27e26181c67c6d300f526cd53801)

Just deployed this patch to production. Will monitor the number of Parsoid workers and close this bug if that number remains constant.

Things look good so far, so closing as fixed.