Page MenuHomePhabricator

Normal jobs sometimes run on tools-webgrid-tomcat.eqiad.wmflabs
Closed, InvalidPublic

Description

nl:User:Valhallasw-toolserver-botje runs from a crontab on tools-login under the nlwikibots service group:

0 * * * * qsub $HOME/bin/tvpupdater > /dev/null

The job is queued hourly, and only at midnight local time (Europe/Amsterdam), pages are edited on nlwiki. This currently corresponds to 2300UTC, and 2200UTC in the near future.

In the edit message, the bot reports the host at which it is currently running:
https://nl.wikipedia.org/wiki/Speciaal:Bijdragen/Valhallasw-toolserver-botje

Expected behavior would be running from the tools-exec-* hosts, but the bot often runs from the tools-webgrid-tomcat host.

$HOME/bin/tvpupdater sets several SGE parameters:

#$ -l h_rt=0:30:00 # max runtime
#$ -l virtual_free=25M # max memory use, excluding shared libs, toolserver
#$ -l h_vmem=256M # max memory use, including shared libs, Tools Labs
#$ -l arch=* # mag op zowel linux als solaris
#$ -N tvpupdater-valhallasw # naam van taak, eindigt in naam eigenaar
#$ -M valhallasw@arctus.nl
#$ -m a # alleen mails bij een abort (vanwege bv. runtime-overschrijding)
#$ -b y # draai over netwerkschijf ipv het bestand te kopiëren
#$ -o /dev/null # output naar /dev/null
#$ -e $HOME/tvpupdater-valhallasw.err

and then changes the directory, and invokes ~/bots/tvpupdater/runbot, which activates a virtualenv and, in turn, starts the actual bot script.

I have been able to get two job numbers for runs on the tomcat hosts. Their qacct data is shown below.

qacct -j 40719

qname webgrid-tomcat
hostname tools-webgrid-tomcat.eqiad.wmflabs
group tools.nlwikibots
owner tools.nlwikibots
project NONE
department defaultdepartment
jobname tvpupdater-valhallasw
jobnumber 40719
taskid undefined
account sge
priority 0
qsub_time Mon Mar 17 23:00:02 2014
start_time Mon Mar 17 23:00:13 2014
end_time Mon Mar 17 23:00:45 2014
granted_pe NONE
slots 1
failed 0
exit_status 0
ru_wallclock 32
ru_utime 0.196
ru_stime 0.056
ru_maxrss 16456
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 12120
ru_majflt 0
ru_nswap 0
ru_inblock 8
ru_oublock 32
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 639
ru_nivcsw 78
cpu 0.252
mem 0.019
io 0.002
iow 0.000
maxvmem 209.957M
arid undefined

qacct -j 64375

qname webgrid-tomcat
hostname tools-webgrid-tomcat.eqiad.wmflabs
group tools.nlwikibots
owner tools.nlwikibots
project NONE
department defaultdepartment
jobname tvpupdater-valhallasw
jobnumber 64375
taskid undefined
account sge
priority 0
qsub_time Fri Mar 21 23:00:02 2014
start_time Fri Mar 21 23:00:17 2014
end_time Fri Mar 21 23:00:43 2014
granted_pe NONE
slots 1
failed 0
exit_status 0
ru_wallclock 26
ru_utime 0.216
ru_stime 0.080
ru_maxrss 16444
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 12106
ru_majflt 2
ru_nswap 0
ru_inblock 64
ru_oublock 32
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 506
ru_nivcsw 171
cpu 0.296
mem 0.021
io 0.002
iow 0.000
maxvmem 209.945M
arid undefined


Version: unspecified
Severity: minor

Details

Reference
bz62942

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:52 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz62942.

Job #190385 seems to be somehow related (after "sudo qmod -cj 190385" in tools-master's /var/spool/gridengine/qmaster/messages):

05/12/2014 22:59:51workertools-masterWjob 190385.1 failed on host tools-webgrid-tomcat.eqiad.wmflabs general searching requested shell because: 05/12/2014 22:59:50 [3838:26265]: execvp(/var/spool/gridengine/execd/tools-webgrid-tomcat/job_scripts/190385, "/var/spool/gridengine/execd/tools-webgrid-tomcat/job_scripts/190385") failed: No such file or directory
05/12/2014 22:59:51workertools-masterWrescheduling job 190385.1

I don't know if the job would start on a normal exec node (because if it'd be a systematic error affecting all jobs, users would be shouting *very* loudly), but the coincidence is certainly interesting.

Sorry it took so long to figure this out, but the error was too obvious and staring us in the face and it took a new fresh look to notice it:

Your job script does not specify a requested queue. You would normally want to add

#$ -q task

to it. Otherwise, the grid will just pick any suitable compatible queue and, since you are making no extraordinary demands - most will fit.

Doh! Yes, that makes sense. Added it now, so tonight it should run on a normal host.