Page MenuHomePhabricator

morebots (adminbot) doesn't reliably detect disconnects
Closed, ResolvedPublic

Description

On Jun 29 02:13 UTC adminbot logged a LocalisationUpdate. Five hours later, at 07:18 UTC, it disconnected from IRC, with a server-generated ping timeout quit message. On Jul 1 Tim noticed that it was absent from the channel and checked the process state. It appeared to still be in a connected state, calling select() at regular intervals.

lsof showed:
adminlogb 20258 adminbot 4u IPv4 8395416 0t0 TCP wikitech-static:57198->HUBBARD.CLUB.CC.CMU.EDU:afs3-fileserver (ESTABLISHED)

strace showed:
1372650153.070033 select(5, [4], [], [], {0, 51423}) = 0 (Timeout)
1372650153.122075 gettimeofday({1372650153, 122173}, NULL) = 0
1372650153.122379 select(5, [4], [], [], {0, 100000}) = 0 (Timeout)
1372650153.222975 gettimeofday({1372650153, 223084}, NULL) = 0

According to http://poe.perl.org/?POE_Cookbook/IRC_Bot_Reconnecting, a good disconnection detection algorithm should periodically ping the server to check that the connection is still alive. morebots does not.

The IRC library that morebots uses, irclib, does have a 'set_keepalive' method on the ServerConnection object, which causes it to ping the server at regular intervals. morebots should use it. We should also add an explicit check that a ping reply has been received in a timely fashion, and recycle the connection otherwise.


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=59696

Details

Reference
bz50485

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 1:55 AM
bzimport set Reference to bz50485.
bzimport added a subscriber: Unknown Object (MLST).

Happened again. Increasing priority.

See also:
https://bitbucket.org/jaraco/irc/issue/16/irc-client-ping-timeout-https://bitbucket.org/jaraco/irc/issue/1/library-does-not-detect-that-connection-is

Additional notes:

  • Tends to happen during the weekend.
  • logmsgbot uses the same library, does not set a keepalive, and remains reliably connected.
  • morebots is hosted on wikitech-static, which is hosted on Rackspace

This supports the theory that this is caused by an aggressive TCP idle timeout that the library is not sufficiently robust to handle.

The upstream package maintainer doesn't seem especially interested in chasing this down or in documenting his changes properly, so I don't think it'd be worth the effort to update the Debian package to pull in his latest changes. I think we should implement something very crude but effective, like having the bot keep a 5-minute timer that resets whenever any data at all is read from the socket. If the timer reaches 0, the bot should kill itself and have upstart or init respawn it.

We can also probably move it to tools.

  • Bug 51777 has been marked as a duplicate of this bug. ***

(In reply to comment #4)

We can also probably move it to tools.

Filed as bug 52069.

morebots has gone missing from #wikimedia-operations again.

Aklapper lowered the priority of this task from High to Medium.Apr 9 2015, 1:13 PM