Page MenuHomePhabricator

Stashbot needs a restart loop
Closed, DeclinedPublic

Description

Stashbot regularly dies, because of netsplits and whatnot. For the simplest cases, a restart loop should be (re)instated.

This is how it used to work, as documented by Tim: «It is started automatically from /etc/rc.local so it's not usually necessary to run this shell script directly. Sometimes the python client gets stuck. So kill it and let it restart itself [...]»
https://wikitech.wikimedia.org/w/index.php?title=Morebots&oldid=46101


See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=50485

Details

Reference
bz59696

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:20 AM
bzimport set Reference to bz59696.
bzimport added a subscriber: Unknown Object (MLST).

(In reply to comment #1)

As it is running as

I meant to continue "a grid job" and write about continuous jobs on Tools that are all peachy, before I noticed that continuous jobs are only restarted when they exit with failure which may or may not be the case for morebots.

Aklapper triaged this task as Medium priority.Apr 9 2015, 1:13 PM

Stashbot runs as a Kubernetes deployment which is similar to a continuous grid job, but more likely to restart if the process dies completely. It also implements a ping/pong check for irc connection health which should prevent most hung irc connection issues (rLTSTe49aed30c052: Ping every 5 minutes and disconnect if pongs aren't received).

It is still possible for the python run loop to hang somehow without the Kubernetes pod dying. This could be mitigated by adding some logic in the bot itself to touch a heartbeat file periodically and adding a matching liveness probe to the Kubernetes manifest.

Luke081515 renamed this task from morebots needs a restart loop to Stashbot needs a restart loop.Sep 27 2017, 2:31 PM
Luke081515 updated the task description. (Show Details)
Luke081515 removed a subscriber: wikibugs-l-list.
Luke081515 subscribed.

Per T61696#3003892 and the general stability of the service since adding ping/pong connection verification.