From 7a2e51064aee24ff418f5cf0bdb00ca766c685b3 Mon Sep 17 00:00:00 2001
From: Richard Darst <rkd@zgib.net>
Date: Mon, 6 May 2024 09:51:12 +0300
Subject: [PATCH] content/2024/how-busy-is-the-cluster: further info

---
 content/2024/how-busy-is-the-cluster.rst | 132 ++++++++++++++++++++---
 1 file changed, 119 insertions(+), 13 deletions(-)

diff --git a/content/2024/how-busy-is-the-cluster.rst b/content/2024/how-busy-is-the-cluster.rst
index 8a31faf..b704472 100644
--- a/content/2024/how-busy-is-the-cluster.rst
+++ b/content/2024/how-busy-is-the-cluster.rst
@@ -12,16 +12,19 @@ long do I have to wait?  Is there some dashboard that can tell me?
 
 The answer is, unfortunately, not so easy.  :external:doc:`Our cluster
 <triton/index>` uses dynamic scheduling with a fairshare algorithm.
-All users have a priority, and jobs are ranked by priority, and
-scheduled in that order.  If there are unschedulable holes between
-those jobs, it can take a job with a lower priority and fill them in.
-Users priority decrease when they run more.  So that gives us:
+All users have a fairshare priority, which decreases the more you have
+recently run.  Jobs are ranked by priority (including fairshare plus
+other factors), and scheduled in that order.  If there are
+unschedulable holes between those jobs, it can take a job with a lower
+priority and fill them in ("backfilling").  So that gives us:
 
 - A small-enough job with a low priority might still be scheduled
   soon.
 - A higher priority user could submit something while you are waiting,
   and increase your wait time.
-- A existing job could end early, making all other wait times shorter.
+- An existing job could end early, making other wait times shorter.
+- An existing job could end early, allowing some other higher priority
+  jobs to run sooner, making backfilled jobs run later.
 
 In short: there is no way to give an estimate of the wait time, in the
 way people want.  We've tried but haven't find a way to answer the
@@ -30,14 +33,19 @@ question well.
 What can we know?
 
 
-
 Priority comparison
 -------------------
 
-You can compare your priority with other users.  If you run ``sshare``
-you can see the shares.
+You can compare your fairshare factor with other users.  If you run
+``sshare`` you can see the fairshare (higher means higher priority).
+``sprio`` shows relatively priority for all jobs (here, the raw values
+are multiplied by some factor and added).  On Triton, the "age" value
+is "1e7 × (1-(time_in_queue/7day))" (but maxes out at 7 days) (zero
+when first submitted, increasing to 10000 at 7 days old), and the
+fairshare factor is 1e7.  The others are mostly constant.
 
-TODO: how to interpret sshare?
+Still: this is all very abstract and what others submit has more
+effect than your priority.
 
 This is quite cluster dependent so we'd recommend asking for help for
 how your own cluster is setup.
@@ -49,7 +57,8 @@ When the cluster is mostly empty
 In this case, if there is a slot for you, you are scheduled very soon.
 ``srun --test [RESOURCE_REQUESTS]`` might give you some hint about
 when a job would be scheduled - it basically tries to schedule an
-empty job
+empty job and reports the currently estimated start time. (It uses a
+JobID though so don't run it in a loop)
 
 
 The cluster has a long queue
@@ -63,9 +72,106 @@ what you submit - and this is always changing.  You can tell something
 about how soon you'd be scheduled by looking at your priority relative
 to other users.  Make your jobs as small and efficient as possible to
 fit in between the holes of other jobs and get scheduled as soon as
-possible.  See the `Tetris metaphor here in TTT4HPC
+possible.  If you can break one big job into smaller pieces (less
+time, less CPU, less memory) that depend on each other, then you can
+better fit in between all of the big jobs.  See the `Tetris metaphor
+here in TTT4HPC
 <https://coderefinery.github.io/TTT4HPC_resource_management/scheduling/>`__
 
 If your need is "run stuff quickly for testing", make sure the jobs
-are as short as possible.  Ask your cluster staff about development or
-debugging partitions that may be of use.
+are as short as possible.  Hopefully, your cluster staff about
+development or debugging partitions that may be of use, because that's
+the solution for quick tests.
+
+
+Long, older description
+-----------------------
+
+This description was in an old version of our docs but has since been
+removed.  So pasting here:
+
+
+Triton queues are not first-in first-out, but "fairshare".  This means
+that every person has a priority.  The more you run the lower your
+user priority.  As time passes, your user priority increases again.
+The longer a job waits in the queue, the higher its job priority goes.
+So, in the long run (if everyone is submitting an never-ending stream
+of jobs), everyone will get exactly their share.
+
+Once there are priorities, then: jobs are scheduled in order of
+priority, then any gaps are backfilled with any smaller jobs that can
+fit in.  So small jobs usually get scheduled fast regardless.
+
+*Warning: from this point on, we get more and more technical, if you
+really want to know the details.  Summary at the end.*
+
+What's a share?  Currently shares are based on department and their
+respective funding of Triton (``sshare``).  Shares are shared among
+everyone in the department, but each person has their own priority.
+Thus, for medium users, the 2-week usage of the rest of your
+department can affect how fast your jobs run.  However, again, things
+are balanced per-user within departments.  (However, one heavy user in
+a department can affect all others in that department a bit too much,
+we are working on this)
+
+Your priority goes down via the "job billing": roughly time×power.
+CPUs are billed at 1/s (but older, less powerful CPUs cost less!).
+Memory costs .2/GB/s.  But: you only get billed for the max of memory
+or CPU. So if you use one CPU and all the memory (so that no one else
+can run on it), you get billed for all memory but no CPU.  Same for
+all CPUs and little memory.  This encourages balanced use.  (this also
+applies to GPUs).
+
+GPUs also have a billing weight, currently tens of times higher than a
+CPU billing weight for the newest GPUs.  (In general all of these can
+change, for the latest info see search ``BillingWeights`` in
+``/etc/slurm/slurm.conf``).
+
+If you submit a long job but it ends early, you are only billed for
+the actual time you use (but the longer job might take longer to start
+at the beginning).  Memory is always billed for the full reservation
+even if you use less, since it isn't shared.
+
+The "user priority" is actually just a record how much you have
+consumed lately (the billing numbers above).  This number goes down
+with a half-life decay of 2 weeks.  Your personal priority your share
+compared to that, so we get the effect described above: the more you
+(or your department) runs lately, the lower your priority.
+
+If you want your stuff to run faster, the best way is to more
+accurately specify your time (may make that job can find a place
+sooner) and memory (avoids needlessly wasting your priority).
+
+While your job is pending in the queue SLURM checks those metrics
+regularly and recalculates job priority constantly.  If you are
+interested in details, take a look at `multifactor priority plugin
+<https://slurm.schedmd.com/priority_multifactor.html>`__ page (general
+info) and `depth-oblivious fair-share factor
+<https://slurm.schedmd.com/priority_multifactor3.html>`__ for what we
+use specifically (warning: very in depth page).  On Triton, you can
+always see the latest billing weights in ``/etc/slurm/slurm.conf``
+
+Numerically, job priorities range from 0 to 2^32-1.  Higher is
+sooner to run, but really the number doesn't mean much itself.
+
+These commands can show you information about your user and job
+priorities:
+
+.. csv-table::
+   :delim: |
+
+   ``slurm s``         | list of jobs per user with their current priorities
+   ``slurm full``      | as above but almost all of the job parameters are listed
+   ``slurm shares``    | displays usage (RawUsage) and current FairShare weights (FairShare, higher is better) values for all users
+   ``sshare``          | Raw data of the above
+   ``sprio``           | Raw priority of queued jobs
+   ``slurm j <jobid>`` | shows ``<jobid>`` detailed info including priority, requested nodes etc.
+
+..
+   ``slurm p gpu``       |     # shows partition parameters incl. Priority=
+
+
+tl;dr: Just select the resources you think you need, and slurm
+tries to balance things out so everyone gets their share.  The best
+way to maintain high priority is to use resources efficiently so you
+don't need to over-request.