Page MenuHomePhabricator

jstart doesn't check existence of resubmitted tasks
Closed, DeclinedPublic

Description

I got in qstat:

job-ID prior name user state submit/start at queue slots ja-task-ID

801291 0.32618 php_dispat local-liange Rr 08/28/2013 02:00:02 continuous@tools-exec-05.pmtpa 1
869600 0.26803 php_dispat local-liange r 08/27/2013 19:00:17 continuous@tools-exec-01.pmtpa 1

with having a jstart call in crontab. I guess it's because jstart didn't see that Rr task and started a new one.

Category State SGE Letter Code
Running running r
Running running, re-submit Rr


Version: unspecified
Severity: normal

Details

Reference
bz53629

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:09 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz53629.

I can't reproduce that:

scfc@tools-login:~$ echo sleep 10m > sleep-test.sh && chmod +x sleep-test.sh
scfc@tools-login:~$ jstart -N sleep-test ./sleep-test.sh
Your job 2415536 ("sleep-test") has been submitted
scfc@tools-login:~$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
2415536 0.25000 sleep-test scfc r 02/03/2014 03:03:38 continuous@tools-exec-06.pmtpa 1
scfc@tools-login:~$ qmod -rj 2415536
Pushed rescheduling of job 2415536 on host tools-exec-06.pmtpa.wmflabs
scfc@tools-login:~$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
2415536 0.25000 sleep-test scfc Rr 02/03/2014 03:04:38 continuous@tools-exec-03.pmtpa 1
scfc@tools-login:~$ jstart -N sleep-test ./sleep-test.sh
scfc@tools-login:~$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
2415536 0.25000 sleep-test scfc Rr 02/03/2014 03:04:38 continuous@tools-exec-03.pmtpa 1
scfc@tools-login:~$

So is there any other possible cause for the original issue?

There is always the possibility of a race condition; there is no locking, so if you jstart twice within a very short period of time (a few seconds) both invocations would so none running and start; but that seems unlikely if you start with cron unless the interval is fairly short and tools-login was *really* loaded.

(In reply to comment #3)

There is always the possibility of a race condition; there is no locking, so
if
you jstart twice within a very short period of time (a few seconds) both
invocations would so none running and start; but that seems unlikely if you
start with cron unless the interval is fairly short and tools-login was
*really* loaded.

That cron entry is "0/10 * * * * $HOME/mw/startLabsDispatchRC.sh". Is the interval too short?

Also do you think it's a good bug report (so it's not WONTFIXed) about having no locking?

10 minutes seems long enough that I'm really surprised this could have happened at all; I might have expected it to happen at the 1-2 minute range at the most.

Locking would be a reasonable added safeguard, and even when cron gets replaced it would remain useful, but has a few implementation gotchas that will be tricky to get right. Nevertheless, having a bug for it would not be a bad thing.