Skip to content

Commit

Permalink
content/2024/how-busy-is-the-cluster: further info
Browse files Browse the repository at this point in the history
rkdarst committed May 6, 2024
1 parent 2301976 commit 7a2e510
Showing 1 changed file with 119 additions and 13 deletions.
132 changes: 119 additions & 13 deletions content/2024/how-busy-is-the-cluster.rst
Original file line number Diff line number Diff line change
@@ -12,16 +12,19 @@ long do I have to wait? Is there some dashboard that can tell me?

The answer is, unfortunately, not so easy. :external:doc:`Our cluster
<triton/index>` uses dynamic scheduling with a fairshare algorithm.
All users have a priority, and jobs are ranked by priority, and
scheduled in that order. If there are unschedulable holes between
those jobs, it can take a job with a lower priority and fill them in.
Users priority decrease when they run more. So that gives us:
All users have a fairshare priority, which decreases the more you have
recently run. Jobs are ranked by priority (including fairshare plus
other factors), and scheduled in that order. If there are
unschedulable holes between those jobs, it can take a job with a lower
priority and fill them in ("backfilling"). So that gives us:

- A small-enough job with a low priority might still be scheduled
soon.
- A higher priority user could submit something while you are waiting,
and increase your wait time.
- A existing job could end early, making all other wait times shorter.
- An existing job could end early, making other wait times shorter.
- An existing job could end early, allowing some other higher priority
jobs to run sooner, making backfilled jobs run later.

In short: there is no way to give an estimate of the wait time, in the
way people want. We've tried but haven't find a way to answer the
@@ -30,14 +33,19 @@ question well.
What can we know?



Priority comparison
-------------------

You can compare your priority with other users. If you run ``sshare``
you can see the shares.
You can compare your fairshare factor with other users. If you run
``sshare`` you can see the fairshare (higher means higher priority).
``sprio`` shows relatively priority for all jobs (here, the raw values
are multiplied by some factor and added). On Triton, the "age" value
is "1e7 × (1-(time_in_queue/7day))" (but maxes out at 7 days) (zero
when first submitted, increasing to 10000 at 7 days old), and the
fairshare factor is 1e7. The others are mostly constant.

TODO: how to interpret sshare?
Still: this is all very abstract and what others submit has more
effect than your priority.

This is quite cluster dependent so we'd recommend asking for help for
how your own cluster is setup.
@@ -49,7 +57,8 @@ When the cluster is mostly empty
In this case, if there is a slot for you, you are scheduled very soon.
``srun --test [RESOURCE_REQUESTS]`` might give you some hint about
when a job would be scheduled - it basically tries to schedule an
empty job
empty job and reports the currently estimated start time. (It uses a
JobID though so don't run it in a loop)


The cluster has a long queue
@@ -63,9 +72,106 @@ what you submit - and this is always changing. You can tell something
about how soon you'd be scheduled by looking at your priority relative
to other users. Make your jobs as small and efficient as possible to
fit in between the holes of other jobs and get scheduled as soon as
possible. See the `Tetris metaphor here in TTT4HPC
possible. If you can break one big job into smaller pieces (less
time, less CPU, less memory) that depend on each other, then you can
better fit in between all of the big jobs. See the `Tetris metaphor
here in TTT4HPC
<https://coderefinery.github.io/TTT4HPC_resource_management/scheduling/>`__

If your need is "run stuff quickly for testing", make sure the jobs
are as short as possible. Ask your cluster staff about development or
debugging partitions that may be of use.
are as short as possible. Hopefully, your cluster staff about
development or debugging partitions that may be of use, because that's
the solution for quick tests.


Long, older description
-----------------------

This description was in an old version of our docs but has since been
removed. So pasting here:


Triton queues are not first-in first-out, but "fairshare". This means
that every person has a priority. The more you run the lower your
user priority. As time passes, your user priority increases again.
The longer a job waits in the queue, the higher its job priority goes.
So, in the long run (if everyone is submitting an never-ending stream
of jobs), everyone will get exactly their share.

Once there are priorities, then: jobs are scheduled in order of
priority, then any gaps are backfilled with any smaller jobs that can
fit in. So small jobs usually get scheduled fast regardless.

*Warning: from this point on, we get more and more technical, if you
really want to know the details. Summary at the end.*

What's a share? Currently shares are based on department and their
respective funding of Triton (``sshare``). Shares are shared among
everyone in the department, but each person has their own priority.
Thus, for medium users, the 2-week usage of the rest of your
department can affect how fast your jobs run. However, again, things
are balanced per-user within departments. (However, one heavy user in
a department can affect all others in that department a bit too much,
we are working on this)

Your priority goes down via the "job billing": roughly time×power.
CPUs are billed at 1/s (but older, less powerful CPUs cost less!).
Memory costs .2/GB/s. But: you only get billed for the max of memory
or CPU. So if you use one CPU and all the memory (so that no one else
can run on it), you get billed for all memory but no CPU. Same for
all CPUs and little memory. This encourages balanced use. (this also
applies to GPUs).

GPUs also have a billing weight, currently tens of times higher than a
CPU billing weight for the newest GPUs. (In general all of these can
change, for the latest info see search ``BillingWeights`` in
``/etc/slurm/slurm.conf``).

If you submit a long job but it ends early, you are only billed for
the actual time you use (but the longer job might take longer to start
at the beginning). Memory is always billed for the full reservation
even if you use less, since it isn't shared.

The "user priority" is actually just a record how much you have
consumed lately (the billing numbers above). This number goes down
with a half-life decay of 2 weeks. Your personal priority your share
compared to that, so we get the effect described above: the more you
(or your department) runs lately, the lower your priority.

If you want your stuff to run faster, the best way is to more
accurately specify your time (may make that job can find a place
sooner) and memory (avoids needlessly wasting your priority).

While your job is pending in the queue SLURM checks those metrics
regularly and recalculates job priority constantly. If you are
interested in details, take a look at `multifactor priority plugin
<https://slurm.schedmd.com/priority_multifactor.html>`__ page (general
info) and `depth-oblivious fair-share factor
<https://slurm.schedmd.com/priority_multifactor3.html>`__ for what we
use specifically (warning: very in depth page). On Triton, you can
always see the latest billing weights in ``/etc/slurm/slurm.conf``

Numerically, job priorities range from 0 to 2^32-1. Higher is
sooner to run, but really the number doesn't mean much itself.

These commands can show you information about your user and job
priorities:

.. csv-table::
:delim: |

``slurm s`` | list of jobs per user with their current priorities
``slurm full`` | as above but almost all of the job parameters are listed
``slurm shares`` | displays usage (RawUsage) and current FairShare weights (FairShare, higher is better) values for all users
``sshare`` | Raw data of the above
``sprio`` | Raw priority of queued jobs
``slurm j <jobid>`` | shows ``<jobid>`` detailed info including priority, requested nodes etc.

..
``slurm p gpu`` | # shows partition parameters incl. Priority=

tl;dr: Just select the resources you think you need, and slurm
tries to balance things out so everyone gets their share. The best
way to maintain high priority is to use resources efficiently so you
don't need to over-request.

0 comments on commit 7a2e510

Please sign in to comment.