Page MenuHomePhabricator

[OPS] Jenkins: puppet master fills /var on labs with yaml reports
Closed, ResolvedPublic

Description

integration-puppetmaster has a /var of only 1.9GB and most of it is filled up with:

110M /var/lib/git
1.2G /var/lib/puppet

Looking at the disk usage graphs, one can see it has close to 0 free space and is constantly going up and down every few hours.

https://tools.wmflabs.org/nagf/?project=integration#h_integration-puppetmaster_disk

It going up is puppet runs writing yaml files to /var/lib/puppet/reports. It going down is the the cleanup cronjob running.

https://github.com/wikimedia/operations-puppet/blob/fcf51231/modules/puppetmaster/manifests/scripts.pp#L35

I ran it manually once to delete beyond past 2 hours instead of past 36 hours and it shrank from 1.2G to 115M for /var/lib/puppet.


Version: unspecified
Severity: normal

Details

Reference
bz73472

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:46 AM
bzimport set Reference to bz73472.

The /var on labs is indeed only 2GB. puppetmaster reports takes 600MB of disk right now.

modules/puppetmaster/manifests/scripts.pp has a cronjob 'removeoldreports' which removes the reports after 2160 minutes (or 36 hours). I am wondering whether we could use hierea() to use a lower retention time and run puppet less often. CCing Giuseppe and Yuvi.

/var is /dev/vda2 , I am wondering whether it can be extended somehow. CCing Andrew B and Marc-André.

(In reply to Antoine "hashar" Musso (WMF) from comment #1)

/var is /dev/vda2 , I am wondering whether it can be extended somehow.
CCing Andrew B and Marc-André.

The latest images, through some rather ugly trickery, have /var on a logical volume and thus are expandable at will. No such luck for the older images which have physical partitions.

I looked at the state of the beta cluster puppet master (deployment-salt).

There, /var/lib is a symlink to /srv/var-lib/ which gives more free space. The puppet.master has the reports.logstash which explains why nothing is written on disk.

Change 174132 had a related patch set uploaded by Yuvipanda:
puppetmaster: Make time to keep old reports for configurable

https://gerrit.wikimedia.org/r/174132

(In reply to Antoine "hashar" Musso (WMF) from comment #3)

I looked at the state of the beta cluster puppet master (deployment-salt).

There, /var/lib is a symlink to /srv/var-lib/ which gives more free space.
The puppet.master has the reports.logstash which explains why nothing is
written on disk.

On beta we have a patch to send reports to logstash which discards reporting on disk https://gerrit.wikimedia.org/r/#/c/143788/10/modules/puppetmaster/templates/30-logstash.conf.erb,unified

(In reply to Antoine "hashar" Musso (WMF) from comment #1)

The /var on labs is indeed only 2GB. puppetmaster reports takes 600MB of
disk right now.

Can we not just increase the size of the beta cluster instances' diskspace? We've run into this issue many many many many times and playing whack-a-mole with symlinks and cronjobs to move data around is not sustainable.

Greg --

For new instances /var/log is somewhat resizeable. For existing instances you can remount /var/log but that's very messy since every service expects to already have an open file and a directory in /var/log.

(In reply to Andrew Bogott from comment #7)

Greg --

For new instances /var/log is somewhat resizeable.

How much? Can we just change the default for new deployment-prep instances to be $large-enough-to-not-matter?

For existing instances
you can remount /var/log but that's very messy since every service expects
to already have an open file and a directory in /var/log.

Worst case scenario is creating a second instance of whatever with a larger disk, moving traffic to it, then shutting down the old one, right? Not saying we should do that soon, but... continued hacks like this are hurting the stability of Beta Cluster (as opposed to addressing the real underlying issue of too little space on the VMs we use for our integration environment which everyone depends on daily).

(In reply to Greg Grossmeier from comment #8)

(In reply to Andrew Bogott from comment #7)

Greg --

For new instances /var/log is somewhat resizeable.

How much? Can we just change the default for new deployment-prep instances
to be $large-enough-to-not-matter?

Resizeable up to the available space selected when the instance was originally created.

It should be possible to set up sizing of /var/log based on project. I'll have a look at that if that's the direction you want to go.

Worst case scenario is creating a second instance of whatever with a larger
disk, moving traffic to it, then shutting down the old one, right?

That's correct. In perfect-puppet-land, doing that should be trivial, but I've been led to understand that in the real world it's a big pain.

(as opposed to addressing the real underlying
issue of too little space on the VMs we use for our integration environment
which everyone depends on daily).

One might argue that the 'real problem' is unbounded log growth, and that beta just displays the symptoms sooner than production. But I don't know if the issue really is unbounded growth or if growth is bounded properly but just bounded outside the capacity of existing instances.

(In reply to Andrew Bogott from comment #9)

(In reply to Greg Grossmeier from comment #8)

(In reply to Andrew Bogott from comment #7)

Greg --

For new instances /var/log is somewhat resizeable.

How much? Can we just change the default for new deployment-prep instances
to be $large-enough-to-not-matter?

Resizeable up to the available space selected when the instance was
originally created.

It should be possible to set up sizing of /var/log based on project. I'll
have a look at that if that's the direction you want to go.

I guess we should way this ^ and the unbounded growth concern below.

Worst case scenario is creating a second instance of whatever with a larger
disk, moving traffic to it, then shutting down the old one, right?

That's correct. In perfect-puppet-land, doing that should be trivial, but
I've been led to understand that in the real world it's a big pain.

Sadly, but that also points out other legitimate bugs :)

(as opposed to addressing the real underlying
issue of too little space on the VMs we use for our integration environment
which everyone depends on daily).

One might argue that the 'real problem' is unbounded log growth, and that
beta just displays the symptoms sooner than production. But I don't know if
the issue really is unbounded growth or if growth is bounded properly but
just bounded outside the capacity of existing instances.

Touche. But I'm still worried about all the differences between prod and beta that cause surprises :/

Just keeping the heat on this bug, we had an outage this morning (times in Eastern US):
07:49 < icinga-wm> PROBLEM - BetaLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: deployment-prep.deployment-mediawiki01.diskspace._var.byte_avail.value (33.33%)

That probably caused the outage (the only other thing around that time is bug 73567, which hasn't been fixed/reverted yet beta is back up).

I *really really really* want to just throw hardware at the problem, but it's a pain given how OpenStack/Beta work, but I'm getting annoyed by all the warnings that we can't do anything else about. Our (Release Engineering's) job is not to rework prod logging policies on a case-by-case basis to make it work in Beta. Continued diff creation for reasons like that only complexify (it's a word) things.

I know about two reasons for the HHVM application servers on beta cluster fill /var/ :

Bug 73262 - hhvm apache fills /var/log/apache2 with access logs

They need to send their log to syslog (that would thus end up to the logstash instance) instead of writing to disk debug / access logs.

Some bug I can't find which is that the HHVM coredump end up under /var/ as well when they should be saved to /data/project (since we care about) and garbage collected automatically (Bryan wrote a cron to handle that).

Finally this bug with puppet filling puppet master disk, that is being worked on by Yuvi.

Sorry for hijacking this bug. I can't firefight all the issues nor triage / set priority on bugs flagged hhvm.

Change 174132 merged by Yuvipanda:
puppetmaster: Make time to keep old reports for configurable

https://gerrit.wikimedia.org/r/174132

Can someone with projectadmin on integration project edit https://wikitech.wikimedia.org/wiki/Hiera:Integration to add the line:

"puppetmaster::scripts::keep_report_minutes": 360

This will keep reports only for 6 hours.

(In reply to Yuvi Panda from comment #14)

Can someone with projectadmin on integration project edit
https://wikitech.wikimedia.org/wiki/Hiera:Integration to add the line:

"puppetmaster::scripts::keep_report_minutes": 360

This will keep reports only for 6 hours.

I have copy pasted on:
https://wikitech.wikimedia.org/wiki/Hiera:Integration

Updated the git repo on integration-puppetmaster.eqiad.wmflabs to include the above Gerrit change and ran puppet. The puppet crontab still has the old entry:

  1. crontab -l -u puppet |egrep -v ^# 27 0,8,16 * * * find /var/lib/puppet/reports -type f -mmin +2160 -delete

:-/

<yuvipanda> hashar: bah, typo on my end. it's 'keep_reports_minutes' (s after report)

I have reedited the wiki page, ran puppet again:

Notice: /Stage[main]/Puppetmaster::Scripts/Cron[removeoldreports]/command:
command changed

   'find /var/lib/puppet/reports -type f -mmin +2160 -delete'
to 'find /var/lib/puppet/reports -type f -mmin +360 -delete'



# crontab -l -u puppet |egrep -v ^#
27 0,8,16 * * * find /var/lib/puppet/reports -type f -mmin +360 -delete

That solves the issue for the 'integration' project.

I did the same for 'deployment-prep' ( https://wikitech.wikimedia.org/w/index.php?title=Hiera:Deployment-prep&diff=135116&oldid=134263 ) and it is all happy as well.

Thank you Yuvi!