Skip to content

A tool to display the longest tenant job on Slurm cluster nodes

Notifications You must be signed in to change notification settings

lgorenstein/longest-job

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 

Repository files navigation

longest-job

A quick hack tool to show when busy cluster nodes could become free (i.e. when the last running Slurm job on each node exits).

By default, reports such high-water mark jobs on every cluster node that has at least one active job on it (calls squeue --states=R under the hood). A handful of familiar squeue options can be used to further limit reporting to certain nodes (by partition, by account, by QoS, etc). Alternatively, explicit Slurm-style nodelists can be passed as arguments.

Usage:

    longest-job [-h|--help] [OPTIONS] [nodelist(s)]

General options:

  -t, --time     Sort output by job end time (default is to sort by node name).
  -v, --verbose  Be verbose (print more job information for each job).
      --quiet    Be really quiet (suppress header and non-essential output).
  -V, --version  Print program version and exit.
  -h, --help     Display this help message and exit.

Recognized squeue-style filtering options (passed to the underlying squeue call verbatim, see man squeue for further details):

  -A, --account=<account_list>
  -j, --jobs=<job_id_list>
  -L, --licenses=<license_list>
  -M, --clusters=<clusters_list>
  -n, --name=<name_list>
  -p, --partition=<part_list>
  -q, --qos=<qos_list>
  -R, --reservation=<reservation_name>
  -u, --user=<user_list>
  -w, --nodelist=<hostlist>

In true squeue fashion, multiple filters are AND-ed. If both -w hostlist flag and explicit node names ($1...$n) are given, explicit ones win.

As a special case, nodelists in $1...$n can be absolute paths. Cue this excerpt from man scontrol:

scontrol show hostlist can also take the absolute pathname of a file (beginning with the character '/') containing a list of hostnames.

Note: if a node does not have active running jobs (e.g. is idle or offlined), no output is generated for it (even if the node has been specified explicitly). Because "no job" means "no longest job either".

Example

By node in a given partition:

$ longest-job --partition bell-b | head -3
NODE          JobID         END_TIME
bell-b000     3640405       2021-06-26T11:52:00
bell-b001     3606031       2021-06-25T21:48:04
bell-b002     3598003       2021-06-26T06:34:30

Or by earliest time:

$ longest-job --partition bell-b -t | head -3
NODE          JobID         END_TIME
bell-b001     3606031       2021-06-25T21:48:04
bell-b004     3606035       2021-06-25T21:52:04
bell-b002     3598003       2021-06-26T06:34:30

Author:

Lev Gorenstein [email protected], Purdue University Research Computing, 2021.

Contribute: https://github.com/lgorenstein/longest-job

About

A tool to display the longest tenant job on Slurm cluster nodes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages