Page MenuHomePhabricator

Yell loudly of failed puppet runs on Beta Cluster instances
Closed, ResolvedPublic

Description

Sometimes puppet breaks, it happens, but we need to know when it happens in Beta Cluster.

Something similar to the production icinga puppet freshness check, or even tying it in with Zuul and reporting it if a puppet run fails.


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=67349

Details

Reference
bz67333

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:39 AM
bzimport set Reference to bz67333.
bzimport added a subscriber: Unknown Object (MLST).

Sadly we can't really use icinga properly on labs (so I'm told, due to the way resource collection works with puppet). Also the prod icinga stuff is not really replicatable on labs either, with prod specific config hard-coded everywhere (would need to be converted to a module first).

A different solution needs to be thought of.

One thing we could do in beta would be to send puppet logs to the beta logstash instance and then announce to Cloud-Services and/or #wikimedia-qa via a logstash rule. See bug 60690 for a puppet into logstash general request.

https://gerrit.wikimedia.org/r/#/c/143193/ will log puppetagent metrics into labs graphite, including last run time.

(In reply to Yuvi Panda from comment #1)

Sadly we can't really use icinga properly on labs (so I'm told, due to the
way resource collection works with puppet). Also the prod icinga stuff is
not really replicatable on labs either, with prod specific config hard-coded
everywhere (would need to be converted to a module first).

I wasn't thinking of necessarily copy/pasting the icinga config from prod, but we have a beta labs icinga (http://icinga.wmflabs.org/icinga/) which could theoretically be used for this, no?

(In reply to Greg Grossmeier from comment #4)

I wasn't thinking of necessarily copy/pasting the icinga config from prod,
but we have a beta labs icinga (http://icinga.wmflabs.org/icinga/) which
could theoretically be used for this, no?

Theoretically, yeah :) But that effort was largely undocumented and unpuppetized, and I don't know of anyone who has actually touched that in forever.

(In reply to Yuvi Panda from comment #5)

Theoretically, yeah :) But that effort was largely undocumented and
unpuppetized, and I don't know of anyone who has actually touched that in
forever.

Sad :(

We now have notifications on irc channel #wikimedia-qa and a few people receives an hourly mail until all instances pass puppet.

The only thing left to do, is having more people to receive the notifications.

Blocks Bug 51497 - Setup monitoring for Beta cluster

yuvipanda claimed this task.

Yelling loudly now :)