Page MenuHomePhabricator

create grid node for checking weblinks
Closed, ResolvedPublic

Description

I need a grid node for checking weblinks in the German wikipedia article namespace with sufficient memory to check 200 links in parallel with an array job in an interval of two weeks. See http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs/20131123.txt for the calculation. The grid should be named tools-exec-giftbot-01.


Version: unspecified
Severity: normal

Details

Reference
bz58949

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:25 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz58949.

Reading the IRC log, I don't quite understand why you need a *node* of your own. Apparently, you want to run 200 jobs in parallel, and the problem is the 12 concurrent jobs/user limit. So you really want to have the limit for your bot raised to 200?

I ask because the grid isn't really saturated; http://ganglia.wmflabs.org/latest/graph_all_periods.php?c=tools&h=tools-master&v=0&m=sge_pending&r=hour&z=default&jr=&js=&st=1387968592&z=large shows that the number of pending jobs is almost always 0.

(In reply to comment #1)

Reading the IRC log, I don't quite understand why you need a *node* of your
own. Apparently, you want to run 200 jobs in parallel, and the problem is
the
12 concurrent jobs/user limit. So you really want to have the limit for your
bot raised to 200?

[...]

Just checked: Currently the limit seemed to be defined by:

scfc@tools-login:~$ qconf -srqs
{
name jobs
description NONE
enabled FALSE
limit users {*} queues {continuous,task} to jobs=16
}
scfc@tools-login:~$

*but* which a) is "enabled FALSE" and b) apparently allows *32* jobs per user even in one queue ("for NR in {1..100}; do qsub -q task -b y sleep 1m; done").

I changed "enabled" to "TRUE" and added a first rule:

scfc@tools-login:~$ sudo qconf -srqs
{
name jobs
description NONE
enabled TRUE
limit users scfc to jobs=200
limit users {*} queues {continuous,task} to jobs=16
}
scfc@tools-login:~$

But I was still only able to launch 32 jobs, so I changed it back.

Further digging brought up:

scfc@tools-login:~$ qconf -ssconf
[...]
maxujobs 32
[...]

Ah! I'll test how to set per-user quotas over the next few days.

It was Coren's idea to use a dedicated node, so that the number of jobs would be unlimited. Also if you really want to raise the tool's limit instead, raise it to 200 plus the usual 32 (being 232). I have other scripts that need to be run and should not to be delayed by a 10-day period of blocking grid resources.

The reason why I'd dedicate a node to this is that, for 200 jobs sharing executables, the VMEM-based resource allocation would /vastly/ overcommit and clog the normal nodes (VMEM-based functions on a worst-case memory footprint basis, which we /know/ is not the case when you control the executable).

Compare tools-webgrid-01 where a *lot* of jobs are running with large VMEM limits; this is possible because we know that the lighttpd footprint is shared between every job.

(In reply to comment #4)

The reason why I'd dedicate a node to this is that, for 200 jobs sharing
executables, the VMEM-based resource allocation would /vastly/ overcommit and
clog the normal nodes (VMEM-based functions on a worst-case memory footprint
basis, which we /know/ is not the case when you control the executable).

Compare tools-webgrid-01 where a *lot* of jobs are running with large VMEM
limits; this is possible because we know that the lighttpd footprint is
shared
between every job.

In this case, you're right :-).

Queue 'gift' was created with a medium instance and is accesible to local-giftbot