create grid node for checking weblinks
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Giftpflanze
	Dec 24 2013, 9:47 PM

Description

I need a grid node for checking weblinks in the German wikipedia article namespace with sufficient memory to check 200 links in parallel with an array job in an interval of two weeks. See http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs/20131123.txt for the calculation. The grid should be named tools-exec-giftbot-01.

Version: unspecified
Severity: normal

Details

Reference: bz58949

Related Objects

Mentioned In: T123270: Make gridengine exec hosts also submit hosts
T99067: document the need and usage patterns for special exec hosts

Event Timeline

• bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:25 AM

• bzimport added a project: Toolforge.

• bzimport set Reference to bz58949.

Giftpflanze created this task.Dec 24 2013, 9:47 PM

Reading the IRC log, I don't quite understand why you need a *node* of your own. Apparently, you want to run 200 jobs in parallel, and the problem is the 12 concurrent jobs/user limit. So you really want to have the limit for your bot raised to 200?

I ask because the grid isn't really saturated; http://ganglia.wmflabs.org/latest/graph_all_periods.php?c=tools&h=tools-master&v=0&m=sge_pending&r=hour&z=default&jr=&js=&st=1387968592&z=large shows that the number of pending jobs is almost always 0.

(In reply to comment #1)

Reading the IRC log, I don't quite understand why you need a *node* of your
own. Apparently, you want to run 200 jobs in parallel, and the problem is
the
12 concurrent jobs/user limit. So you really want to have the limit for your
bot raised to 200?

[...]

Just checked: Currently the limit seemed to be defined by:

scfc@tools-login:~$ qconf -srqs

{

name jobs

description NONE

enabled FALSE

limit users {*} queues {continuous,task} to jobs=16

}

scfc@tools-login:~$

*but* which a) is "enabled FALSE" and b) apparently allows *32* jobs per user even in one queue ("for NR in {1..100}; do qsub -q task -b y sleep 1m; done").

I changed "enabled" to "TRUE" and added a first rule:

scfc@tools-login:~$ sudo qconf -srqs

{

name jobs

description NONE

enabled TRUE

limit users scfc to jobs=200

limit users {*} queues {continuous,task} to jobs=16

}

scfc@tools-login:~$

But I was still only able to launch 32 jobs, so I changed it back.

Further digging brought up:

scfc@tools-login:~$ qconf -ssconf

[...]

maxujobs 32

[...]

Ah! I'll test how to set per-user quotas over the next few days.

It was Coren's idea to use a dedicated node, so that the number of jobs would be unlimited. Also if you really want to raise the tool's limit instead, raise it to 200 plus the usual 32 (being 232). I have other scripts that need to be run and should not to be delayed by a 10-day period of blocking grid resources.

The reason why I'd dedicate a node to this is that, for 200 jobs sharing executables, the VMEM-based resource allocation would /vastly/ overcommit and clog the normal nodes (VMEM-based functions on a worst-case memory footprint basis, which we /know/ is not the case when you control the executable).

Compare tools-webgrid-01 where a *lot* of jobs are running with large VMEM limits; this is possible because we know that the lighttpd footprint is shared between every job.

(In reply to comment #4)

The reason why I'd dedicate a node to this is that, for 200 jobs sharing
executables, the VMEM-based resource allocation would /vastly/ overcommit and
clog the normal nodes (VMEM-based functions on a worst-case memory footprint
basis, which we /know/ is not the case when you control the executable).

Compare tools-webgrid-01 where a *lot* of jobs are running with large VMEM
limits; this is possible because we know that the lighttpd footprint is
shared
between every job.

In this case, you're right :-).

Queue 'gift' was created with a medium instance and is accesible to local-giftbot

Giftpflanze mentioned this in T99067: document the need and usage patterns for special exec hosts.May 14 2015, 2:43 PM

scfc mentioned this in T123270: Make gridengine exec hosts also submit hosts.Jan 11 2016, 8:53 PM

create grid node for checking weblinksClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

create grid node for checking weblinks
Closed, ResolvedPublic
Actions