Page MenuHomePhabricator

Jenkins: Job runner slaves in labs no longer updated by puppet
Closed, ResolvedPublic

Description

I don't know for how many weeks or months this has been broken but the logs are full of failures since at least July 14.

Info: Retrieving plugin
Error: Could not set 'file' on ensure: cannot generate tempfile `/var/lib/puppet/lib/puppet/parser/functions/floor.rb20140719-26226-1g5xssw-9'
Error: Could not set 'file' on ensure: cannot generate tempfile `/var/lib/puppet/lib/puppet/parser/functions/floor.rb20140719-26226-1g5xssw-9'
Wrapped exception:
cannot generate tempfile `/var/lib/puppet/lib/puppet/parser/functions/floor.rb20140719-26226-1g5xssw-9'
Error: /File[/var/lib/puppet/lib/puppet/parser/functions/floor.rb]/ensure: change from absent to file failed: Could not set 'file' on ensure: cannot generate tempfile `/var/lib/puppet/lib/puppet/parser/functions/floor.rb20140719-26226-1g5xssw-9'
Info: Loading facts in /var/lib/puppet/lib/facter/facter_dot_d.rb
Info: Loading facts in /var/lib/puppet/lib/facter/physicalcorecount.rb
Info: Loading facts in /var/lib/puppet/lib/facter/apt.rb
Info: Loading facts in /var/lib/puppet/lib/facter/root_home.rb
Info: Loading facts in /var/lib/puppet/lib/facter/default_gateway.rb
Info: Loading facts in /var/lib/puppet/lib/facter/puppet_vardir.rb
Info: Loading facts in /var/lib/puppet/lib/facter/meminbytes.rb
Info: Loading facts in /var/lib/puppet/lib/facter/ec2id.rb
Info: Loading facts in /var/lib/puppet/lib/facter/pe_version.rb
Info: Loading facts in /var/lib/puppet/lib/facter/puppet_config_dir.rb
Info: Loading facts in /var/lib/puppet/lib/facter/projectgid.rb
Info: Caching catalog for i-000003cb.eqiad.wmflabs
Error: Could not retrieve catalog from remote server: cannot generate tempfile `/var/lib/puppet/client_data/catalog/i-000003cb.eqiad.wmflabs.json20140719-26226-15hd19i-9'
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
Error: Could not save last run local report: cannot generate tempfile `/var/lib/puppet/state/last_run_summary.yaml20140719-26226-n9zv3g-9'


Version: wmf-deployment
Severity: critical
See Also:
https://rt.wikimedia.org/Ticket/Display.html?id=7945

Details

Reference
bz68254

Event Timeline

bzimport raised the priority of this task from to Unbreak Now!.Nov 22 2014, 3:30 AM
bzimport set Reference to bz68254.
bzimport added a subscriber: Unknown Object (MLST).

/var is full. Someone thought it would be a good idea to only allocate 2GB to /var for labs instance. Once Ubuntu is installed there is only a few hundred megabytes free :-/

integration-slave1001.eqiad.wmflabs$ du -h /var/log/diamond
1.1G /var/log/diamond

integration-slave1002$ du -h /var/log/diamond
1.2G /var/log/diamond

integration-slave1003:~$ du -h /var/log/diamond
1.1G /var/log/diamond

Basically diamond logs have never been rotated, the first entry in the log date back from May 22nd.

Cleared out /var/log/diamond/diamond.log on the three slaves + on puppetmaster.

We would need a RT ticket to figure out why diamond logs are not logrotated and whether it affects others instances / production.

(In reply to Antoine "hashar" Musso from comment #3)

Cleared out /var/log/diamond/diamond.log on the three slaves + on
puppetmaster.

Has that improved the puppet situation?

We would need a RT ticket to figure out why diamond logs are not logrotated
and whether it affects others instances / production.

https://rt.wikimedia.org/Ticket/Display.html?id=7945

what is the _latest_ time stamp for these logs. My guess is they are orphaned and can be removed.

Looks like the underlying issue has already been fixed, thanks Chase.

https://bugzilla.wikimedia.org/show_bug.cgi?id=66458

Confirmation that puppet is running successfully?

(In reply to Chase from comment #5)

what is the _latest_ time stamp for these logs. My guess is they are
orphaned and can be removed.

I haven't looked at the last timestamp. The files were definitely being written too though.

We use our own puppetmaster which is rebased manually. YuviPanda commented on bug 66458 that:

It does log, but only logs errors. We killed the archive handler that logged all the *metrics* being sent, which was causing the huge log files.

So I guess that was fixed by a puppet change. Since most instances were/are broken the fix never landed.

I have to verify all instances now.

The logs are smaller now :-) Thank you!