Page MenuHomePhabricator

Determine first pass list of icinga-alerting data from graphite.wmflabs
Closed, ResolvedPublic

Description

Let's get some icinga alerts so we know when things are going sideways in Beta Cluster.


Version: unspecified
Severity: normal

Details

Reference
bz70141

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:34 AM
bzimport set Reference to bz70141.
  • No puppet run for more than 1h
  • Presence of any puppet failures

What else?

My first pass list (puppet fails on important vms):

  • deployment-prep.deployment-bastion.puppetagent.failed_events.value > 0
  • deployment-prep.deployment-mediawiki01.puppetagent.failed_events.value > 0
  • deployment-prep.deployment-mediawiki02.puppetagent.failed_events.value > 0

I just realized that you can't hit Labs URLs from prod, and so we can't actually do this right now because of that :(

Two options:

  1. File an RT ticket to allow access to graphite.wmflabs.org from labmon1001.
  2. Wait for labmon1001 to be setup.

Unsure if Ops would be ok with (1), and (2) is blocked on the network config.

  • deployment-prep.deployment-mediawiki01.diskspace.root.byte_free.value < 2 gigs
  • deployment-prep.deployment-mediawiki02.diskspace.root.byte_free.value < 2 gigs
  • deployment-prep.deployment-mediawiki01.diskspace._var.byte_free.value < 1 gig
  • deployment-prep.deployment-mediawiki02.diskspace._var.byte_free.value < 1 gig

(In reply to Yuvi Panda from comment #4)

Two options:

  1. File an RT ticket to allow access to graphite.wmflabs.org from labmon1001.
  2. Wait for labmon1001 to be setup.

Unsure if Ops would be ok with (1), and (2) is blocked on the network config.

(2) https://rt.wikimedia.org/Ticket/Display.html?id=8163

(In reply to Greg Grossmeier from comment #6)

(In reply to Yuvi Panda from comment #4)

Two options:

  1. File an RT ticket to allow access to graphite.wmflabs.org from labmon1001.
  2. Wait for labmon1001 to be setup.

Unsure if Ops would be ok with (1), and (2) is blocked on the network config.

(2) https://rt.wikimedia.org/Ticket/Display.html?id=8163

That RT is now done (thanks mark!). So now just waiting on labsmon1001 to be setup, I presume.

12:38 < YuviPanda> greg-g: labmon is setup - labmon.wmflabs.org :) Am sending metrics on to it now
12:38 < YuviPanda> I'll rename it to graphite.wmflabs.org soon

There now exists monitoring for puppet failures and disk space (https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=labmon). Puppet failures need to be tweaked further since they currently do not bail when puppet fails with a syntax error or something like that.

Note that the alert are for all the machines, in betalabs, not just for the ones listed. I added more features to our check_graphite script to make this kind of monitoring easy / possible.

Change 159694 had a related patch set uploaded by Yuvipanda:
labmon: Add low space check for / on betalabs

https://gerrit.wikimedia.org/r/159694

Change 159701 had a related patch set uploaded by Yuvipanda:
labmon: Add puppet freshness check for betalabs

https://gerrit.wikimedia.org/r/159701

Also, who is responsible for fixing the errors that pop up? There are puppet failures on videoscaler-01 now, and I've no idea how to fix those.

(On that note, I'd also remove myself from the alert groups once the initial setting up is stabilized)

Change 159694 merged by Andrew Bogott:
labmon: Add low space check for / on betalabs

https://gerrit.wikimedia.org/r/159694

Change 159701 merged by Andrew Bogott:
labmon: Add puppet freshness check for betalabs

https://gerrit.wikimedia.org/r/159701

Yuvi: Thanks for the first pass work! Once you remove yourself from the list of people who get the alerts, feel free to close this bug (the "first pass" of this is done).

(In reply to Greg Grossmeier from comment #17)

Yuvi: Thanks for the first pass work! Once you remove yourself from the list
of people who get the alerts, feel free to close this bug (the "first pass"
of this is done).

Done waiting, closing for housekeeping reasons :)