Page MenuHomePhabricator

monitor dispatch stats
Closed, ResolvedPublic

Description

We need a job that monitors dispatch stats on Wikidata and notifies us when the lags are too high.


Version: unspecified
Severity: normal
Whiteboard: u=dev c=infrastructure p=13 s=2014-06-17

Details

Reference
bz65291

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedDzahn

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:09 AM
bzimport set Reference to bz65291.

This should be doable with something similar to the job queue monitor in ganglia that reports to IRC

This is a Perl script for Nagios that can retrieve dispatch values from the API and output a short message.

https://github.com/ChristopherHJohnson/check_dispatch

This should be reviewed on Gerrit somewhere and tested with Nagios. Nagios should be able to report alerts to IRC. Threshold for critical average lag should be established on production.

We need to add a warning threshold at a median lag of 2 minutes.

added an Icinga contact for aude to the private puppet repo.

icinga contact name "aude" can be used in contactgroups now, which are in the public puppet repo

i tried a couple ways to escape this already to get around the error

"command .. does not exit".. but didn't work yet.

unfortunately see stuff like http://support.nagios.com/forum/viewtopic.php?t=10596&p=54166

I don't think it is related to that forum post, as it is not escaping in the regexp that needs to be done as it worked when run on the command line.
Meanwhile this didn't help either: https://gerrit.wikimedia.org/r/158081
Will try to reproduce the problem on labs.

it is definitely escaping, i tried manually to change the arguments to something without special characters, that made it work. and that forum post discusses problems with escaping. i wonder how to reproduce in labs without a labs icinga instance or even a class that could be applied to an instance :(

Sorry you are right that post is actually on escaping the regexp from nagios/icinga config file syntax.
I'm trying unsuccessfully to apply icinga::monitor to a puppetmaster-self.
Meanwhile another try to fix the problem: https://gerrit.wikimedia.org/r/#/c/158119/

That try didn't work either. Will try further in labs.

20:29 <+icinga-wm> RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1243 bytes in 0.699 second

response time

work around in
https://gerrit.wikimedia.org/r/#/c/158319/

yea, uhm, i worked around this annoying issue as shown above.. that fixed it for now. we can turn this into aa template and pass parameters if we care...

it was in "draft" status. i created the needed contacts in private puppet repo, then just published the draft and merged it. checked on neon. contacts have been created..

contactgroups.cfg: contactgroup_name wikidata
contactgroups.cfg: members wikidata-monitoring,aude,jzerebecki
contacts.cfg: contact_name wikidata-monitoring
contacts.cfg: email wikidata-monitoring..

etc..

Change 158492 had a related patch set uploaded by Dzahn:
icinga-wm - configure to also serve Wikidata

https://gerrit.wikimedia.org/r/158492

Change 158495 had a related patch set uploaded by Dzahn:
add irc-wikidata contact to wikidata services

https://gerrit.wikimedia.org/r/158495

Change 158492 merged by Dzahn:
icinga-wm - configure to also serve Wikidata

https://gerrit.wikimedia.org/r/158492

Change 158495 merged by Dzahn:
add irc-wikidata contact to wikidata services

https://gerrit.wikimedia.org/r/158495

via the last couple changes you now have an IRC bot (icinga-wm) in Wikidata

and it will output only stuff for the services it is a contact for .. :)

13:44 -!- icinga-wm [~icinga-wm@neon.wikimedia.org] has joined Wikidata

13:50 < icinga-wm> CUSTOM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1248 bytes in 0.907 second

response time

13:50 < mutante> weee, it works


root@neon:/var/log/icinga# cat irc-wikidata.log
CUSTOM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1248 bytes in 0.907 second response time