Page MenuHomePhabricator

/var is full on tools-webgrid-01 due to me spamming /var/log/auth.log with sudo
Closed, ResolvedPublic

Description


Version: unspecified
Severity: blocker

Details

Reference
bz64683

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 22 2014, 3:21 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz64683.

For the time until we fix bug #61102, I have installed a script /home/scfc/bin/cleanup-php-cgis per crontab on tools-login to kill orphaned php-cgi processes on tools-webgrid-01 and tools-webgrid-02.

During its development on April 27th I had started a faulty version of it that called "sudo kill -HUP" ad infinitum on the webnodes even when there were no php-cgi processes to kill, adding about 4 KByte/s to /var/log/auth.log, thus filling up /var.

The correct version installed per crontab only logs about 1 KByte/5 minutes (ssh connect from tools-login to tools-webgrid-01/tools-webgrid-02).

There was a sparkle where I could have noted the error as my installed script sometimes complained about processes disappearing between detection and killing which I assumed was the odd correct php-cgi shutdown, but in reality apparently was just a race condition between the competing scripts.

I've inspected tools-login, tools-webgrid-01 and tools-webgrid-02 for any ancient processes, and there are now none. Also, I moved /var/log/auth.log to /data/project/admin/auth.log.scfc.bz2 and "stop rsyslogd && start rsyslogd" to get tools-webgrid-01 going again.

/var/log/auth.log would normally be kept for about four weeks, so I'll leave this bug open to either remove /data/project/admin/auth.log.scfc.bz2 in a month or braid it back into the logrotate process in two weeks when it would normally be compressed as well.

I've now moved auth.log.4.gz to auth.log.5.gz and /data/project/admin/auth.log.scfc.bz2 (re-compressed) to auth.log.4.gz.