Page MenuHomePhabricator

Puppet failure on deployment-cache-bits01
Closed, ResolvedPublic

Description

Has been failing for almost a month now, monitoring didn't catch it. Shinken just caught it...

Error: Failed to apply catalog: Could not find dependent Service[gmond] for Exec[replace varnish.pyconf] at /etc/puppet/modules/varnish/manifests/monitoring/ganglia.pp:26


Version: unspecified
Severity: normal

Details

Reference
bz73263

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:48 AM
bzimport set Reference to bz73263.
bzimport added a subscriber: Unknown Object (MLST).

Also on deployment-cache-text02:

Error: Failed to apply catalog: Could not find dependent Service[gmond] for File[/etc/ganglia/conf.d/vhtcpd.pyconf] at /etc/puppet/modules/varnish/manifests/monitoring/ganglia/vhtcpd.pp:15

Also on deployment-cache-mobile03

Error: Failed to apply catalog: Could not find dependency File[/usr/lib/ganglia/python_modules] for File[/usr/lib/ganglia/python_modules/varnish.py] at /etc/puppet/modules/varnish/manifests/monitoring/ganglia.pp:10

Looks like they're all related to labs no longer running gmond, which was merged around the time this problem started happening.

We killed gmond because labs has no ganglia, and running it was just taking up time. So we either need to bring gmond back (boo?) or do some realm branching (boo!).

gerritadmin wrote:

Change 172776 had a related patch set uploaded by Yuvipanda:
cache: Don't setup ganglia monitoring on labs

https://gerrit.wikimedia.org/r/172776

Whelp, that patch only addressed the tip of the iceberg. Our varnish code deeply entangles with itself our ganglia code, and labs has no gmond on each instance. This causes failures and general puppy killing.

Will look at it with _joe_ tomorrow.

Yuvi: Mukunda can help out with this now that his time is starting to be less PHABRICATORPHABRICATORPHABRICATOR

Ah, indeed :) All help is welcome! :)

_joe_ and alex have also offered to help, since this involves ganglia a fair bit and alex is working on fixing our ganglia code.

@Mukunda: Can you co-ordinate with _joe_ and alex to get this fixed? I'd love to have this off my hands so I can continue the shinken work :)

I would think the ideal solution is to decouple the monitoring from the varnish module in puppet. But if it's heavily tangled then that might not be an easy task.

gerritadmin wrote:

Change 172776 abandoned by Yuvipanda:
cache: Don't setup ganglia monitoring on labs

Reason:
Superseeded by https://gerrit.wikimedia.org/r/#/c/172974/1

https://gerrit.wikimedia.org/r/172776

Ok, the cache machines are ok now! \o/

Thanks _joe_!