We need a job that monitors dispatch stats on Wikidata and notifies us when the lags are too high.
Version: unspecified
Severity: normal
Whiteboard: u=dev c=infrastructure p=13 s=2014-06-17
We need a job that monitors dispatch stats on Wikidata and notifies us when the lags are too high.
Version: unspecified
Severity: normal
Whiteboard: u=dev c=infrastructure p=13 s=2014-06-17
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T68070 Provide config for dispatch-lag-monitoring script | |||
Resolved | Dzahn | T67291 monitor dispatch stats |
This should be doable with something similar to the job queue monitor in ganglia that reports to IRC
This is a Perl script for Nagios that can retrieve dispatch values from the API and output a short message.
https://github.com/ChristopherHJohnson/check_dispatch
This should be reviewed on Gerrit somewhere and tested with Nagios. Nagios should be able to report alerts to IRC. Threshold for critical average lag should be established on production.
added an Icinga contact for aude to the private puppet repo.
icinga contact name "aude" can be used in contactgroups now, which are in the public puppet repo
i tried a couple ways to escape this already to get around the error
"command .. does not exit".. but didn't work yet.
unfortunately see stuff like http://support.nagios.com/forum/viewtopic.php?t=10596&p=54166
I don't think it is related to that forum post, as it is not escaping in the regexp that needs to be done as it worked when run on the command line.
Meanwhile this didn't help either: https://gerrit.wikimedia.org/r/158081
Will try to reproduce the problem on labs.
it is definitely escaping, i tried manually to change the arguments to something without special characters, that made it work. and that forum post discusses problems with escaping. i wonder how to reproduce in labs without a labs icinga instance or even a class that could be applied to an instance :(
Sorry you are right that post is actually on escaping the regexp from nagios/icinga config file syntax.
I'm trying unsuccessfully to apply icinga::monitor to a puppetmaster-self.
Meanwhile another try to fix the problem: https://gerrit.wikimedia.org/r/#/c/158119/
20:29 <+icinga-wm> RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1243 bytes in 0.699 second
response time
work around in
https://gerrit.wikimedia.org/r/#/c/158319/
yea, uhm, i worked around this annoying issue as shown above.. that fixed it for now. we can turn this into aa template and pass parameters if we care...
it was in "draft" status. i created the needed contacts in private puppet repo, then just published the draft and merged it. checked on neon. contacts have been created..
contactgroups.cfg: contactgroup_name wikidata
contactgroups.cfg: members wikidata-monitoring,aude,jzerebecki
contacts.cfg: contact_name wikidata-monitoring
contacts.cfg: email wikidata-monitoring..
etc..
Change 158492 had a related patch set uploaded by Dzahn:
icinga-wm - configure to also serve Wikidata
Change 158495 had a related patch set uploaded by Dzahn:
add irc-wikidata contact to wikidata services
via the last couple changes you now have an IRC bot (icinga-wm) in Wikidata
and it will output only stuff for the services it is a contact for .. :)
13:44 -!- icinga-wm [~icinga-wm@neon.wikimedia.org] has joined Wikidata
13:50 < icinga-wm> CUSTOM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1248 bytes in 0.907 second
response time
13:50 < mutante> weee, it works
root@neon:/var/log/icinga# cat irc-wikidata.log
CUSTOM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1248 bytes in 0.907 second response time