Page MenuHomePhabricator

Soften qdel behaviour from KILL
Closed, ResolvedPublic

Description

At the moment, qdel KILLs the job; this is a bit rude.

If jsub would call "qsub -notify", SGE would signal the job before KILLing it.

The signal is set by "execd_param"'s NOTIFY_KILL; default is SIGUSR1, I would favour SIGTERM (or a SIGHUP -> SIGINT -> SIGTERM cascade) as I suppose more programs will already have a suitable handler for that.

The queue parameter "notify" defines the interval between signals; given that many jobs in Tools use database and other network connections, I would be fairly generous here and propose 60 s (that means in the worst case of a SIGHUP -> SIGINT -> SIGTERM -> SIGKILL cascade 180 s which I find acceptable; for special cases, roots can always log into the exec node and kill at will).


Version: unspecified
Severity: enhancement

Details

Reference
bz61102

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:53 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz61102.
  • Bug 63878 has been marked as a duplicate of this bug. ***

We need to use "qsub -notify" in webservice as well.

metatron wrote:

Concerning (non) termination of php-cgi processes:

http://redmine.lighttpd.net/projects/lighttpd/wiki/Docs_ModFastCGI

There is an option "kill-signal" in .lighttpd fcgi settings.

"kill-signal": By default lighttpd send SIGTERM to FastCGI processes, which were spawned by lighttpd. Applications, which link libfcgi, need to be killed with SIGUSR1. This applies to php <5.2.1, lua-magnet and others.

I tried setting this value to 9, also to 1. But in neither case, the signal was forwarded to the spawned cgi-processes, while killing with 9 and 1 by hand worked.

This (mis)behaviour seems to matter also in the case of overloaded and dying webservices, as overloaded threads/processes are /not/ terminated as they should be.

The program flow is different at the moment: On qdel, SGE kills the master lighttpd process with SIGKILL. Thus, lighttpd never has a chance to kill the php-cgi processes. So kill-signal is irrelevant at the moment.

The grid has been adjusted to use SIGTERM by default now; this problem should be solved.