From 7288da34d2ea54b041916d8e59910b4db6f382da Mon Sep 17 00:00:00 2001 From: Parwana Osmani Date: Wed, 1 Oct 2025 11:42:29 -0700 Subject: [PATCH 1/5] Added GPU GRES info --- docs/scheduler/resources.md | 100 ++++++++++++++++++++++++++++++++++++ 1 file changed, 100 insertions(+) diff --git a/docs/scheduler/resources.md b/docs/scheduler/resources.md index 2997a38a..5682081f 100644 --- a/docs/scheduler/resources.md +++ b/docs/scheduler/resources.md @@ -277,3 +277,103 @@ srun: launch/slurm: _step_signal: Terminating StepId=706.0 ### GPUs / GRES +####Requesting GPU Resources (GRES / GPUs) + +To use GPU-equipped nodes, you must request the GPU resource via Slurm’s Generic RESources (GRES) system. Below are guidelines and examples: + +**Basic syntax** + +Add the following option to your `sbatch` or `srun` command: + +`--gres=gpu:` + +- gpu is the generic resource name. + +- `` is the number of GPUs you need (e.g. 1, 2, etc.). + +Example: +`#SBATCH --gres=gpu:1 +` +This requests 1 GPU on whichever node your job is scheduled. + +You may also combine it with other resource flags, for example: + +```console +#SBATCH --nodes=1 +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=4 +#SBATCH --gres=gpu:1 +#SBATCH --mem=32G +#SBATCH --time=04:00:00 +``` + +This requests: + +- 1 node +- 1 task +- 4 CPU cores for that task +- 32 GB of memory +- 1 GPU +- a time limit of 4 hours + + +####Partition / QOS constraints + +Some GPU nodes may only be available in certain partitions (e.g. gpu, gpu-rt, etc.). Be sure to request the GPU-compatible partition, e.g.: + +`#SBATCH --partition=gpu` + + +Your account or QOS may also impose limits on how many GPUs you’re allowed to use concurrently. The cluster scheduler enforces those limits. + +You can check your associations via: + +`sacctmgr show assoc user=$USER` + +OR + +run: `/opt/hpccf/bin/slurm-show-resources.py --full` + +####Memory per GPU (optional) + +If your cluster supports it, you can also specify memory per GPU using: + +`--mem-per-gpu=` + + +This ensures your job is allocated sufficient memory in proportion to the number of GPUs requested. + + +Example job script (GPU job) + +Here’s a minimal sbatch script requesting one GPU: + +```console +#!/bin/bash +#SBATCH --job-name=my_gpu_job +#SBATCH --output=out_%j.txt +#SBATCH --error=err_%j.txt +#SBATCH --partition=gpu +#SBATCH --nodes=1 +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=4 +#SBATCH --mem=16G +#SBATCH --gres=gpu:1 +#SBATCH --time=02:00:00 + +# load modules or your environment +module load cuda/11.8 + +# run your program +./my_gpu_application --some-option +``` + +####Tips & Caveats + +If you request more GPUs than are available in a partition (or more than your allocation allows), your job will remain pending until resources free up. + +Don’t request GPUs unless your application actually uses them — unnecessary GPU requests may starve other users. + +Always check cluster-specific gpu partition names using `sinfo`. The resource name might be different. + +Use GPU-monitoring tools (e.g. nvidia-smi) inside your job to verify you got the GPU(s) you requested. From be565e72255aad76089ebc99c794e52ad93abe09 Mon Sep 17 00:00:00 2001 From: Parwana Osmani Date: Wed, 1 Oct 2025 14:03:30 -0700 Subject: [PATCH 2/5] added more info on gres/gpue --- docs/scheduler/resources.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/scheduler/resources.md b/docs/scheduler/resources.md index 5682081f..43d93119 100644 --- a/docs/scheduler/resources.md +++ b/docs/scheduler/resources.md @@ -344,8 +344,6 @@ If your cluster supports it, you can also specify memory per GPU using: This ensures your job is allocated sufficient memory in proportion to the number of GPUs requested. -Example job script (GPU job) - Here’s a minimal sbatch script requesting one GPU: ```console From 798dea8cc84ab275447b8a81bb88a9d36bdc8d36 Mon Sep 17 00:00:00 2001 From: Parwana Osmani Date: Thu, 2 Oct 2025 17:01:39 -0700 Subject: [PATCH 3/5] gpu-edited --- docs/scheduler/resources.md | 65 ++++--------------------------------- 1 file changed, 7 insertions(+), 58 deletions(-) diff --git a/docs/scheduler/resources.md b/docs/scheduler/resources.md index 43d93119..a8b93b58 100644 --- a/docs/scheduler/resources.md +++ b/docs/scheduler/resources.md @@ -277,7 +277,7 @@ srun: launch/slurm: _step_signal: Terminating StepId=706.0 ### GPUs / GRES -####Requesting GPU Resources (GRES / GPUs) +#### Requesting GPU Resources (GRES / GPUs) To use GPU-equipped nodes, you must request the GPU resource via Slurm’s Generic RESources (GRES) system. Below are guidelines and examples: @@ -299,79 +299,28 @@ This requests 1 GPU on whichever node your job is scheduled. You may also combine it with other resource flags, for example: ```console -#SBATCH --nodes=1 -#SBATCH --ntasks=1 #SBATCH --cpus-per-task=4 #SBATCH --gres=gpu:1 -#SBATCH --mem=32G -#SBATCH --time=04:00:00 ``` -This requests: -- 1 node -- 1 task -- 4 CPU cores for that task -- 32 GB of memory -- 1 GPU -- a time limit of 4 hours +#### Partition / QOS constraints +Some GPU nodes may only be available in certain partitions (e.g. gpu-a100 on Hive, gpul on Farm cluster and cnsdept-gpu on Franklin cluster). Be sure to request the GPU-compatible partition, e.g.: -####Partition / QOS constraints - -Some GPU nodes may only be available in certain partitions (e.g. gpu, gpu-rt, etc.). Be sure to request the GPU-compatible partition, e.g.: - -`#SBATCH --partition=gpu` +`#SBATCH --partition=gpul` Your account or QOS may also impose limits on how many GPUs you’re allowed to use concurrently. The cluster scheduler enforces those limits. You can check your associations via: -`sacctmgr show assoc user=$USER` - -OR - -run: `/opt/hpccf/bin/slurm-show-resources.py --full` - -####Memory per GPU (optional) - -If your cluster supports it, you can also specify memory per GPU using: - -`--mem-per-gpu=` - - -This ensures your job is allocated sufficient memory in proportion to the number of GPUs requested. + `/opt/hpccf/bin/slurm-show-resources.py --full` -Here’s a minimal sbatch script requesting one GPU: - -```console -#!/bin/bash -#SBATCH --job-name=my_gpu_job -#SBATCH --output=out_%j.txt -#SBATCH --error=err_%j.txt -#SBATCH --partition=gpu -#SBATCH --nodes=1 -#SBATCH --ntasks=1 -#SBATCH --cpus-per-task=4 -#SBATCH --mem=16G -#SBATCH --gres=gpu:1 -#SBATCH --time=02:00:00 - -# load modules or your environment -module load cuda/11.8 - -# run your program -./my_gpu_application --some-option -``` - -####Tips & Caveats -If you request more GPUs than are available in a partition (or more than your allocation allows), your job will remain pending until resources free up. +You can view the information about a GPU partition using the command:- -Don’t request GPUs unless your application actually uses them — unnecessary GPU requests may starve other users. +`scontrol show partition ` -Always check cluster-specific gpu partition names using `sinfo`. The resource name might be different. -Use GPU-monitoring tools (e.g. nvidia-smi) inside your job to verify you got the GPU(s) you requested. From ce0250419d86fd23bf7097a3617c4b85170e7214 Mon Sep 17 00:00:00 2001 From: Parwana Osmani Date: Thu, 2 Oct 2025 17:03:58 -0700 Subject: [PATCH 4/5] gpu-edited --- docs/scheduler/resources.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/scheduler/resources.md b/docs/scheduler/resources.md index a8b93b58..7bcdd7a3 100644 --- a/docs/scheduler/resources.md +++ b/docs/scheduler/resources.md @@ -287,7 +287,7 @@ Add the following option to your `sbatch` or `srun` command: `--gres=gpu:` -- gpu is the generic resource name. +- `gpu` is the generic resource name. - `` is the number of GPUs you need (e.g. 1, 2, etc.). @@ -306,7 +306,7 @@ You may also combine it with other resource flags, for example: #### Partition / QOS constraints -Some GPU nodes may only be available in certain partitions (e.g. gpu-a100 on Hive, gpul on Farm cluster and cnsdept-gpu on Franklin cluster). Be sure to request the GPU-compatible partition, e.g.: +Some GPU nodes may only be available in certain partitions (e.g. `gpu-a100` on Hive, `gpul` on Farm cluster and `cnsdept-gpu` on Franklin cluster). Be sure to request the GPU-compatible partition, e.g.: `#SBATCH --partition=gpul` From 2e7c6e1f9258c52a8388a74553d769eeaf1259fc Mon Sep 17 00:00:00 2001 From: Parwana Osmani Date: Thu, 9 Oct 2025 14:24:56 -0700 Subject: [PATCH 5/5] last edit of reosurces-gpu --- docs/scheduler/resources.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/scheduler/resources.md b/docs/scheduler/resources.md index 4b6f9444..ec8ee44b 100644 --- a/docs/scheduler/resources.md +++ b/docs/scheduler/resources.md @@ -311,7 +311,7 @@ Some GPU nodes may only be available in certain partitions (e.g. `gpu-a100` on H `#SBATCH --partition=gpul` -Your account or QOS may also impose limits on how many GPUs you’re allowed to use concurrently. The cluster scheduler enforces those limits. +Your account or QOS may also impose limits on how many GPUs you are allowed to use concurrently. The cluster scheduler enforces those limits. You can check your associations via: