On Jun 29 02:13 UTC adminbot logged a LocalisationUpdate. Five hours later, at 07:18 UTC, it disconnected from IRC, with a server-generated ping timeout quit message. On Jul 1 Tim noticed that it was absent from the channel and checked the process state. It appeared to still be in a connected state, calling select() at regular intervals.
lsof showed:
adminlogb 20258 adminbot 4u IPv4 8395416 0t0 TCP wikitech-static:57198->HUBBARD.CLUB.CC.CMU.EDU:afs3-fileserver (ESTABLISHED)
strace showed:
1372650153.070033 select(5, [4], [], [], {0, 51423}) = 0 (Timeout)
1372650153.122075 gettimeofday({1372650153, 122173}, NULL) = 0
1372650153.122379 select(5, [4], [], [], {0, 100000}) = 0 (Timeout)
1372650153.222975 gettimeofday({1372650153, 223084}, NULL) = 0
According to http://poe.perl.org/?POE_Cookbook/IRC_Bot_Reconnecting, a good disconnection detection algorithm should periodically ping the server to check that the connection is still alive. morebots does not.
The IRC library that morebots uses, irclib, does have a 'set_keepalive' method on the ServerConnection object, which causes it to ping the server at regular intervals. morebots should use it. We should also add an explicit check that a ping reply has been received in a timely fashion, and recycle the connection otherwise.
Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=59696