Page MenuHomePhabricator

hhvm creates core file in /tmp/ filling mediawiki02 labs instance root partition
Closed, ResolvedPublic

Description

On deployment-mediawiki02:/tmp

-rw------- 1 apache apache 625M Aug 22 22:37 hhvm.29585.core
-rw------- 1 apache apache 641M Aug 22 22:37 hhvm.3112.core
-rw------- 1 apache apache 2.1G Aug 22 21:45 hhvm.25555.core
-rw------- 1 apache apache 2.3G Aug 22 20:56 hhvm.3314.core

That causes / partition to be filled up completely causing a lot of various interesting side effects.

I have deleted them all.


Version: unspecified
Severity: normal

Details

Reference
bz69979

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:40 AM
bzimport set Reference to bz69979.
bzimport added a subscriber: Unknown Object (MLST).

The instance has some local disk space allocated under /srv/ (via puppet class role::labs::lvm::srv ). Would be a nice destination for core files which would be local to the instance and avoid filling the NFS shared disk space.

The latest production puppet code for setting up hhvm moves the cores to /var/log/hhvm. We need to get deployment-mediawiki02 running puppet again and then probably make /var/log/hhvm a symlink to /data/project/logs/hhvm to ensure that we have lots of space for cores. This is of course only useful if we have someone watching for hhvm crashes and doing something to triage the bugs that cause them.

multiple HHVM cores per day seems like a real problem

(In reply to Chris McMahon from comment #3)

multiple HHVM cores per day seems like a real problem

Likely some new to us hhvm bug. Unfortunately we'll need to wait for it to happen again if the cores are gone now.

I have deleted the 2GB+ core files on mediawiki02:/tmp/

Since Bryan and Giuseppe were working on this issue this morning, I'm assigning to Bryan. :)

Bandaid solution:

$ cat cleanup-hhvm-cores
#!/usr/bin/env bash

sudo mv /tmp/hhvm.*.core /data/project/hhvm-cores
sudo mv /var/log/hhvm/stacktrace* /data/project/hhvm-cores

$ crontab -l
*/2 * * * * /home/bd808/cleanup-hhvm-cores

Applied on both deployment-mediawiki01 and deployment-mediawiki02

Unlicking this cookie. The core (ha punny) problem remains but hopefully someone on the hhvm team can start triaging from the cores in /data/project/hhvm-cores

Resseting assignee and priority as this is now no longer an OMG! situation.

For the record:
gjg@deployment-bastion:/data/project/hhvm-cores$ ls -al *core | wc -l
12

(between Aug 26 05:13 and Aug 27 22:34 UTC)

Still at 12 since over night. I wonder if the core dumps were caused by the relatively mega high load due to the automated security audit?

(In reply to Greg Grossmeier from comment #10)

Still at 12 since over night. I wonder if the core dumps were caused by the
relatively mega high load due to the automated security audit?

Or the fuzzing hit legit bugs in our php code that trigger hhvm segfaults.

Note I originally created the bug because hhvm send cores to /tmp/ which should be configured in the hhvm conf file to point somewhere else, or become configurable (I think the path is hardcoded in hhvm).

(In reply to Antoine "hashar" Musso from comment #12)

Note I originally created the bug because hhvm send cores to /tmp/ which
should be configured in the hhvm conf file to point somewhere else, or
become configurable (I think the path is hardcoded in hhvm).

--> https://gerrit.wikimedia.org/r/#/c/157294/2

Change 157294 had a related patch set uploaded by Hashar:
hhvm - make debug path configurable

https://gerrit.wikimedia.org/r/157294

Change 157294 abandoned by Dzahn:
hhvm - make debug path configurable

https://gerrit.wikimedia.org/r/157294

Maybe we should simply disable core dumps by default (by setting the proper limit / sysctl params)? We certainly could on labs.

FYI, by default, the linux kernel creates core files in the process's CWD. If you want to retain the core files just not in /tmp, you can give a file pattern (including path) in /proc/sys/kernel/core_pattern

(In reply to Marc A. Pelletier from comment #17)

FYI, by default, the linux kernel creates core files in the process's CWD.
If you want to retain the core files just not in /tmp, you can give a file
pattern (including path) in /proc/sys/kernel/core_pattern

Right. That can easily be done in the pre-start stanza of the upstart job. But I don't agree with Antoine and Bryan that we should, in fact, do this. If it's important to retain core files, then let's keep them in /tmp. If the beta cluster app servers don't have enough space in /tmp, then that's the actual bug, and we should fix it by making sure they do.

Gratuitous and unprincipled divergence from production compromises both the beta cluster and production: the beta cluster because its fidelity to production is its very value and purpose, and production because the Puppet change needed to make the divergence possible means adding a useless knob to the manifests. Sometimes it's unavoidable, but I don't think this is one of those times.

While I can think of a number of good reasons why you'd want to keep cores in a development environment, would you even /want/ to have core dumps in prod at all in the first place?

(In reply to Marc A. Pelletier from comment #19)

While I can think of a number of good reasons why you'd want to keep cores
in a development environment, would you even /want/ to have core dumps in
prod at all in the first place?

I think I'd be fine with disabling them. We've had bugs before that we couldn't reproduce in our dev environments but we can re-enable core dumps if such a problem manifests again.

hhvm on beta cluster now dumps files to /var/tmp/hhvm which is a 2GB partition. Noticed on deployment-mediawiki01.eqiad.wmflabs. I have deleted the core file.

(In reply to Antoine "hashar" Musso (WMF) from comment #21)

hhvm on beta cluster now dumps files to /var/tmp/hhvm which is a 2GB
partition. Noticed on deployment-mediawiki01.eqiad.wmflabs. I have deleted
the core file.

I have updated my core file sweeping script for the new location:

#!/usr/bin/env bash

sudo mv /tmp/hhvm.*.core /data/project/hhvm-cores &>/dev/null
sudo mv /var/tmp/hhvm/*.core /data/project/hhvm-cores &>/dev/null
sudo mv /var/log/hhvm/stacktrace* /data/project/hhvm-cores &>/dev/null

This is ~bd808/cleanup-hhvm-cores on any deployment-prep host and croned as my user on deployment-mediawiki0[12].

If some unit of code or configuration is

  • Generic (i.e., does not reference resources that are only available in some environments)
  • Correct (works as intended in the environment for which it was designed)

...and it causes Beta to break, then the bug is with Beta, not the unit of code.

The hhvm configuration has:

modules/hhvm/manifests/init.pp:119:                core_dump_report_directory => '/var/log/hhvm',

We would need to vary it on beta cluster to point to a larger partition. Either under:

  • /srv/ which is local to the instances and has much more space (68G on deployment-mediawiki01
  • /data/project/hhvm-cores/$::instance_name which is shared disk space and more permanent.

That will let us drop Bryan cron script.

An additional reason I reported this bug is to make sure the core files in production are not going to fill the application servers /var/ partition.

hashar set Security to None.

I have noticed hhvm fills /var/tmp/core/ on beta cluster which is due to the kernel core directory being set to /var/tmp/core. From modules/base/manifests/environment.pp:

sysctl::parameters { 'core_dumps':
     values  => { 'kernel.core_pattern' => $core_dump_pattern, },
     require => File['/var/tmp/core'],
 }

 tidy { '/var/tmp/core':
     age     => '1w',
     recurse => 1,
     matches => 'core.*',
 }

You can set the core dump path from hiera (see Hiera:tools on wikitech) to be shared storage or somewhere else, but unfortunately not kill core dumps entirely.

yuvipanda claimed this task.

Beta mw instances now have a saner disk partition set up, and this hasn't happened in a while because of the NFS hack anyway...