Skip to content

Latest commit

 

History

History
144 lines (114 loc) · 6.21 KB

gpu-support.md

File metadata and controls

144 lines (114 loc) · 6.21 KB

Deploying with Accelerators in the Cluster Toolkit

Supported modules

Accelerator definition automation

The schedmd-slurm-gcp-v5 modules (node-group, controller and login), the vm-instance module and any module relying on vm-instance (HTCondor, Omnia, PBS Pro) support automation for defining the guest_accelerator config. If the user supplies any value for this setting, the automation will be bypassed.

The automation is handled primary in the gpu_definition.tf files in these modules. This file assumes the existence of two input variables in the module:

  • guest_accelerator: A list of terraform objects with the attributes type and count.
  • machine_type: Defines the machine type of the VM being created.

gpu_definition.tf works by checking the machine_type and associating it with a GPU type and extracting the GPU count. For example, consider the following machine types:

  • a2-high-gpu-4g
    • type will be set to nvidia-tesla-a100
    • count will be set to 4
  • a2-ultragpu-8g
    • type will be set to nvidia-a100-80gb
    • count will be set to 8.

This automation currently only supports machine type a2. Machine type n1 can also have guest accelerators attached, however the type and count cannot be determined automatically like with a2.

Troubleshooting and tips

  • To list accelerator types and availability by region, run gcloud compute accelerator-types list. The information is also available in the Google Cloud documentation here.
  • Deployment time of VMs with many guest accelerators can take longer. See the Timeouts when deploying a compute VM section below if you experience timeouts because of this.

Slurm on GCP

When deploying a Slurm cluster with GPUs, we highly recommend using the modules based on Slurm on GCP version 5 (schedmd-slurm-gcp-v5-*). The interface is more consistent with Cluster Toolkit standards and more functionality is available to support, debug and workaround any issues related to GPU resources.

Interface Considerations

The Slurm on GCP v5 Cluster Toolkit modules (schedmd-slurm-gcp-v5-*) have two variables that can be used to define attached GPUs. The variable guest_accelerators is the recommended option as it is consistent with other modules in the Cluster Toolkit. The setting gpus can be set as well, which provides consistency with the underlying terraform modules from the Slurm on GCP repo.

Timeouts when deploying a compute VM

As mentioned above, VMs with many guest accelerators can take longer to deploy. Slurm sets timeouts for creating VMs, and it's possible for high GPU configurations to push past the default timeout. We recommend using the Slurm on GCP v5 modules.

The v5 Toolkit modules (schedmd-slurm-gcp-v5-*) allow Slurm configuration timeouts to customized via the cloud_parameters variable on the controller. See the example below which increases the resume_timeout from the default of 300s to 600s:

- id: slurm_controller
  source: community/modules/scheduler/schedmd-slurm-gcp-v5-controller
  use: [...]
  settings:
    cloud_parameters:
      resume_rate: 0
      resume_timeout: 600  # Update this value, default is 300
      suspend_rate: 0
      suspend_timeout: 300
      no_comma_params: false
    ...

Launching Slurm jobs with GPUs

In order to utilize the GPUs in the compute VMs deployed by Slurm, the GPU count must be specified when submitting a job with srun or sbatch. For instance, the following srun command launches a job that runs nvidia-smi in a partition called gpu_partition (-p gpu_partition) on a full node (-N 1) with 8 GPUs (--gpus 8):

srun -N 1 -p gpu_partition --gpus 8 nvidia-smi

An equivalent sbatch script:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --partition=gpu_partition
#SBATCH --gpus=8

nvidia-smi

Both commands support further customization of GPU resources. For more information, see the SchedMD documentation:

Further Reading