Page MenuHomePhabricator

Limit number of jobs users can execute in parallel
Closed, ResolvedPublic

Description

In pmtpa, we limited the number of jobs that could be executed in parallel per queue to 16 IIRC. During migration to eqiad, this seems to have been lost, and thus at the moment tools.currentevents has 86 jobs running (further only queued because the load of the exec nodes is saturated):

scfc@tools-dev:~$ qstat -u tools.currentevents | fgrep ' r ' | wc -l
86
scfc@tools-dev:~$

So we need to limit the number of jobs executed in parallel again. Last time, there was some confusion what configuration option limited what and that initially caused only to limit the number of /pending/ jobs and delete others, so we need to be careful about that.

Details

Reference
bz65777

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:24 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz65777.
scfc removed coren as the assignee of this task.Apr 7 2015, 5:00 AM
scfc triaged this task as Medium priority.
scfc updated the task description. (Show Details)
scfc set Security to None.

To make this a bit less confusing, let's make this task about limitting the number of parallel tasks a user can execute and T123270 about setting up execution nodes as submit hosts. T123270 has some information on how we limitted the number of parallel tasks in the past.

Some notes from the duplicate T196495: Limit ability of a single user/tool to overwhealm job grid:

The resource quota system looks like it would require us to list each user specifically. There is however the maxujobs global scheduler setting:

maxujobs

The maximum number of jobs any user may have running in a Sun Grid Engine cluster at the same time. If set to 0 (default) the users may run an arbitrary number of jobs.

We currently have this set in the grid config with a value of 1000 which coincidentally(?) is also the upper limit on jobs per queue. This means that the limit functionally will never take effect.

That limit apparently was set by @valhallasw for giftbot:

@Giftpflanze actually needs more than this because of their array jobs. I'm setting this to 1000 now, which should be enough even for extreme use cases. It might also still be enough to kill gridengine for jobs that are spread out (as opposed to the single giftbot queue), but at least it's better than infinite.

The custom queue for giftbot was closed in T194615: Delete tools-exec-gift-trusty-01.tools.eqiad.wmflabs and giftbot queue. That makes me think that this very high limit for that single tool is no longer necessary. It looks to me like there are two possible paths forward here:

  1. Set maxujobs to a reasonable number--possibly 16 per the initial task description here--to apply a constant limit for each user across the grid
  2. Implement a service that works similar to the existing maintain-kubeusers and maintain-dbusers services which would create a resource quota for each user/tool within the grid engine config. This would let us set limits on things other than job count (slots) like num_proc, mem_free, mem_total, etc. It would also give us a place to add limit variances by user/tool.

The second option could be made more flexible and comprehensive than the first, but requires additional development work and some amount of ongoing maintenance. I would suggest that we start with the global per user limit of maxujobs and then reevaluate when we find credible need for a more advanced setup.

Mentioned in SAL (#wikimedia-cloud) [2019-01-07T15:54:22Z] <bstorm_> T67777 Set stretch grid user job limit to 16

I figure the new grid can start with 16 to see how and where that creates problems at least.

I am curious if the scheduler will simply dump user jobs into a long tail of qw state if we restrict it a lot. So what I'm thinking of trying is reducing the main grid to 50 first to see what happens and then tighten to 16.

I am curious if the scheduler will simply dump user jobs into a long tail of qw state if we restrict it a lot.

That should be what it does, yes. There is a max_u_jobs setting that puts an upper bound on the number of jobs that can be enqueued in any state that could be used to limit qw flooding by a single tool.

Ok, so perhaps I can set that to 50 and maxujobs to 16.

Mentioned in SAL (#wikimedia-cloud) [2019-01-07T17:21:11Z] <bstorm_> T67777 - set the max_u_jobs global grid config setting to 50 in the new grid

I do believe we have decided to leave these limits in place only on the new grid.

I did some testing with my user account and saw both the concurrent and max enqueued limits are active and working as hoped. A quick test is to do something like this:

$ for n in $(seq 1 51); do jsub -N limit-test -j yes -o $(pwd)/limit-test.log -stderr sleep 10; done
Your job 413 ("limit-test") has been submitted
Your job 414 ("limit-test") has been submitted
...
Your job 462 ("limit-test") has been submitted
Unable to run job: job rejected: only 50 jobs are allowed per user (current job count: 50)
Exiting.
$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
    430 0.25001 limit-test bd808        r     01/08/2019 03:17:35 task@tools-sgeexec-0903.tools.     1
    431 0.25001 limit-test bd808        r     01/08/2019 03:17:35 task@tools-sgeexec-0906.tools.     1
...
    445 0.25000 limit-test bd808        r     01/08/2019 03:17:35 task@tools-sgeexec-0901.tools.     1
    446 0.00000 limit-test bd808        qw    01/08/2019 03:16:58                                    1
    447 0.00000 limit-test bd808        qw    01/08/2019 03:16:58                                    1
...
    462 0.00000 limit-test bd808        qw    01/08/2019 03:16:59                                    1

I do believe we have decided to leave these limits in place only on the new grid.

Are the new settings managed by Puppet, or are they just runtime config in the cluster? If possible I'd like to see this set in Puppet somewhere so that we can put a reference to this task along with the setting so that we can keep some institutional memory of why these flags are active and what they do.

The grid script that uses puppet's files in NFS only configures most of the functional grid environment outside global and scheduler stuff. Since qconf can take input from files for global and scheduler, it is possible to build files from templates and then smuggle them in with python, like I did for the rest of the grid. Should be pretty easy to extend it like that.

I'll spawn a task for that. As is, the institutional memory is basically just SAL

bd808 assigned this task to Bstorm.

Let's call this done. I have a feeling we may end up tweaking the limits once we get more tools running on the new grid, but that can be a follow up ticket/discussion. I'm mostly thinking that the 50 queued jobs limit may be too aggressive (T123270#1925290).