Page MenuHomePhabricator

Easy way to define alerts for ganglia data
Closed, InvalidPublic

Description

We collect a lot of useful information in ganglia, but currently can't set up alerts based on this information. This means that we often discover issues only after user reports as manual checking does not scale.


Version: wmf-deployment
Severity: normal
See Also:
https://rt.wikimedia.org/Ticket/Display.html?id=6955

Details

Reference
bz57882

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:39 AM
bzimport set Reference to bz57882.
bzimport added a subscriber: Unknown Object (MLST).

Giuseppe owns the corresponding RT ticket, but no news there.

We have check_graphite now, which is used heavily for labs. I'm improving that constantly, so I'm tempted to consider this done.

Ganglia, I'm not sure at all, however. Let me re-title this to just be ganglia.

yuvipanda renamed this task from Easy way to define alerts for ganglia and graphite data to Easy way to define alerts for ganglia data.Nov 24 2014, 12:49 PM
yuvipanda set Security to None.

drive-by comment: long term we'd like to have checks based only on graphite and let ganglia be only for aggregates / long term views (or no ganglia at all even)

chasemp claimed this task.
chasemp added a project: Grafana.

check_ganglia exists I believe but is not really advisable, check_graphite is used extensively in labs, and some in prod. I am going to close this as not actionable. No further efforts are really being put into ganglia. I think the spirit could continue but is probably included in Grafana or observability work.

@chasemp, are we going to switch the default node monitoring (cpu, memory, network, disk space, IO etc) to graphite any time soon? If not, then I'd propose to keep this ticket open until we have a reasonably easy way to set up alerts on such information.

sure -- I think that's the plan but @fgiunchedi could provide more details, but I have no problem with that

chasemp lowered the priority of this task from High to Medium.Jan 6 2015, 11:37 PM

@chasemp, are we going to switch the default node monitoring (cpu, memory, network, disk space, IO etc) to graphite any time soon? If not, then I'd propose to keep this ticket open until we have a reasonably easy way to set up alerts on such information.

just a note those things are monitored now I believe but are not currently alerted on from graphite data. But the data should exist.

sure -- I think that's the plan but @fgiunchedi could provide more details, but I have no problem with that

@fgiunchedi: Is this still a valid task, 22 months after setting this to stalled status?

@Aklapper no, we're deprecating ganglia so this is invalid now