Setup monitoring for Beta Cluster (tracking)
Open, Stalled, MediumPublic
Actions

Assigned To

None

Authored By

	greg
	Jul 17 2013, 12:02 AM

Description

Setup (more) monitoring of Beta Cluster and expose it through ganglia/icinga/etc. Similar monitoring as to what is of production, just not set to page all of Ops when it breaks (yet! ;) ).

Details

Reference: bz51497

Related Objects
Search...

Status	Assigned	Task
Open	None	T53494 Use Beta cluster as a true canary for code deployments (epic)
Stalled	None	T53497 Setup monitoring for Beta Cluster (tracking)
Resolved	yuvipanda	T54357 Set up graphite monitoring for the beta cluster
Resolved	hashar	T75881 Send MediaWiki profiling on beta cluster to graphite.wmflabs.org
Resolved	yuvipanda	T54867 monitor that application servers are responding
Resolved	yuvipanda	T72141 Determine first pass list of icinga-alerting data from graphite.wmflabs
Resolved	yuvipanda	T69333 Yell loudly of failed puppet runs on Beta Cluster instances
Declined	None	T65296 puppet labsstatus not reported when using role::puppet::self
Declined	yuvipanda	T72862 monitor unsigned salt keys
Resolved	yuvipanda	T87063 Monitor the Parsoid backend service on beta cluster
Resolved	yuvipanda	T87087 Monitor mathoid, cxserver, ciotid and apertium services on betacluster
Stalled	None	T87093 Setup monitoring for database servers in beta cluster
Resolved	fgiunchedi	T144502 deploy prometheus node_exporter and server to deployment-prep
Invalid	None	T128357 Beta cluster job queue is unmonitored / potentially not running all jobs
Resolved	hashar	T130184 beta cluster 'labswiki' not referenced in all-labs.dblist causing jobrunner to error out
Open	None	T125976 Run mediawiki::maintenance scripts in Beta Cluster
Open	None	T295241 Make icinga-wm complain if beta CI jobs haven't run in some period of time

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:06 AM

• bzimport added a project: Beta-Cluster-Infrastructure.

• bzimport set Reference to bz51497.

greg created this task.Jul 17 2013, 12:02 AM

A breakdown of the useful monitoring systems:

Icinga

The puppet manifests already define Icinga checks for a lot of service, that is done via the global define monitor_service. As an example, Varnish instances are blessed with:

monitor_service { "varnish http ${title}":
    description => "Varnish HTTP ${title}",
    check_command => "check_http_generic!varnishcheck!${port}"

}

Which adds the monitoring on icinga.wikimedia.org.

We could get ops involved in setting up the labs instance for beta and do the configuration hack that would prevent paging but drop emails|messages instead.

Ganglia

All labs instances are automatically added in a Ganglia instance:

http://ganglia.wmflabs.org/latest/?r=hour&s=by+name&c=deployment-prep&tab=m

That seems to cover our needs.

Graphite

That would be very nice to have, specially the profiling bits. That project does not have any documentation beside the puppet manifests though. Probably lower priority compared to Icinga.

The way it is done in puppet is by collecting resources which is disabled on labs for security reasons.

The fatal/exception.. counts are now reported on the labs Ganglia instance on the deployment-fluoride.pmtpa.wmflabs node:

http://ganglia.wmflabs.org/latest/?r=hour&cs=&ce=&c=deployment-prep&h=deployment-fluoride&tab=m&vn=&mc=2&z=medium&metric_group=NOGROUPS_%7C_mediawiki

Does not block Bug 49459 - continuous integration monitoring (tracking)

scott.leea wrote:

If this is still an issue can I work on it? If so, please provide any additional details I can to get started.

I chatted yesterday with Yuvi a bit about monitoring and its challenges, and he reminded me that the main problem with applying the prod setup to Labs is that roots can fake Puppet facts by altering facter and thus control to some degree the exported resources (which in themselves are harmless as their template is reviewed by ops in operations/puppet). So the monitoring in Labs would require all monitoring resources to be audited with the assumption that all host data is hostile. Still, I don't like to let go of a working configuration that is tested every day :-).

So two things that crossed my mind this morning:

a) For root at Tools, I had to sign a contract where WMF promises to sue my ass off if I should do something funny. If we could limit the collection of monitoring resources to hosts in Labs projects with roots that are legally bound in a similar way (Tools, Beta, projects by WMF employees, etc.), we could assume that no hostile data is injected. That would solve the problem for the Beta cluster (and Tools ...), but not for all hosts Labs.

b) What is the worst thing that a bright hacker could achieve by being root on a Labs project, carefully faking facts and bringing Labs's Icinga or Ganglia under their control if the latter are hosts in a Labs project themselves? Nothing. He would have started as root in a Labs project and ended as one as well. All the data in Icinga and Ganglia is public.

There's now alerts for the following things for betalabs:

Low space on /var
Low space on /
Puppet staleness (warn at 1h, crit at 12h)
Puppet failure events

Note that puppet failure events is different from puppet failing - failure events means puppet did run, but some events failed. There's no detection for puppet itself failing completely.

You can see those at https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=labmon

Thank you Yuvi for the monitoring! Do we have a way to tweak the body of email notifications? I find them hard to read :-D

yuvipanda closed subtask T69333: Yell loudly of failed puppet runs on Beta Cluster instances as Resolved.Nov 24 2014, 1:02 PM

greg moved this task from To Triage to In-progress on the Beta-Cluster-Infrastructure board.Nov 24 2014, 7:35 PM

@hashar We can tweak the body, indeed. Let me know what you would like :)

faidon unsubscribed.Nov 25 2014, 11:02 AM

yuvipanda closed subtask T54357: Set up graphite monitoring for the beta cluster as Resolved.Nov 25 2014, 12:11 PM

zeljkofilipin unsubscribed.Nov 26 2014, 9:50 AM

These are in shinken.wmflabs.org now. Further things to be monitored should be filed as subtasks of this.

hashar renamed this task from Setup monitoring for Beta cluster to Setup monitoring for Beta cluster (tracking).Jan 5 2015, 1:29 PM

hashar added a project: Tracking-Neverending.

hashar set Security to None.

yuvipanda added a subtask: T87063: Monitor the Parsoid backend service on beta cluster.Jan 17 2015, 5:50 AM

yuvipanda added a subtask: T87087: Monitor mathoid, cxserver, ciotid and apertium services on betacluster.

yuvipanda closed subtask T87087: Monitor mathoid, cxserver, ciotid and apertium services on betacluster as Resolved.Jan 17 2015, 6:00 AM

yuvipanda closed subtask T72862: monitor unsigned salt keys as Declined.Mar 5 2015, 1:05 PM

greg renamed this task from Setup monitoring for Beta cluster (tracking) to Setup monitoring for Beta Cluster (tracking).Mar 10 2015, 8:52 PM

greg updated the task description. (Show Details)

yuvipanda removed yuvipanda as the assignee of this task.Jun 7 2015, 4:57 PM

Luke081515 subscribed.Jul 5 2015, 12:32 PM

hashar changed the status of subtask T87093: Setup monitoring for database servers in beta cluster from Open to Stalled.Oct 30 2015, 10:51 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 30 2015, 10:51 PM

hashar changed the task status from Open to Stalled.Oct 30 2015, 10:51 PM

hashar moved this task from In-progress to Backlog on the Beta-Cluster-Infrastructure board.

Krenair subscribed.Nov 16 2015, 10:53 PM

greg moved this task from Backlog to Epics / Tracking on the Beta-Cluster-Infrastructure board.Aug 5 2016, 8:49 PM

hashar added a subtask: T144502: deploy prometheus node_exporter and server to deployment-prep.Sep 5 2016, 8:24 AM

hashar added a subtask: T128357: Beta cluster job queue is unmonitored / potentially not running all jobs.

hashar added a project: observability.

fgiunchedi mentioned this in T145636: Request increased quota for deployment-prep labs project.Sep 14 2016, 12:32 PM

Mentioned in SAL (#wikimedia-releng) [2016-09-20T13:07:33Z] <godog> add deployment-prometheus01 instance T53497

fgiunchedi closed subtask T144502: deploy prometheus node_exporter and server to deployment-prep as Resolved.Oct 7 2016, 9:14 AM

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 7:02 PM

Setup (more) monitoring of Beta Cluster and expose it through ganglia/icinga/etc. Similar monitoring as to what is of production, just not set to page all of Ops when it breaks (yet! ;) )

Is this about adding beta cluster to the _production_ Icinga or to have a separate Beta Icinga in a Cloud VPS?

Krinkle unsubscribed.May 9 2018, 3:17 PM

hashar closed subtask T128357: Beta cluster job queue is unmonitored / potentially not running all jobs as Invalid.Sep 21 2018, 9:29 PM

The previous comments don't explain what/who exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on"). Hence resetting task status.

(Smallprint, as general orientation for task management: If you wanted to express that nobody is currently working on this task, then the assignee should be removed and/or priority could be lowered instead. If work on this task is blocked by another task, then that other task should be added via Edit Related Tasks... → Edit Subtasks. If this task is stalled on an upstream project, then the Upstream tag should be added. If this task requires info from the task reporter, then there should be instructions which info is needed. If this task is out of scope and nobody should ever work on this, then task status should have the "Declined" status.)