Introduce multi-node training setup #26

sadamov · 2024-05-04T13:44:08Z

Enable multi-node GPU training with SLURM

This PR adds support for multi-node GPU training using the SLURM job scheduler. The changes allow the code to detect if it is running within a SLURM job and automatically configure the number of devices and nodes based on the SLURM environment variables.

Key changes

Set use_distributed_sampler to True when not in evaluation mode to enable distributed training
Detect if running within a SLURM job by checking for the SLURM_JOB_ID environment variable
If running with SLURM:
- Set the number of devices per node (devices) based on the SLURM_GPUS_PER_NODE environment variable, falling back to torch.cuda.device_count() if not set
- Set the total number of nodes (num_nodes) based on the SLURM_JOB_NUM_NODES environment variable, defaulting to 1 if not set

Rationale for using SLURM

SLURM (Simple Linux Utility for Resource Management) is a widely used job scheduler and resource manager for high-performance computing (HPC) clusters. It provides a convenient way to allocate and manage resources, including GPUs, across multiple nodes in a cluster.

By leveraging SLURM, we can easily scale our training to utilize multiple GPUs across multiple nodes without the need for manual configuration.

joeloskarsson

Tested this on multi-gpu without any problems. Will test multi-node on our cluster as soon as I can get my hands on more than 1.

train_model.py

joeloskarsson · 2024-06-03T12:32:29Z

An update on my testing of this: The SLURM constants are read correctly also on our cluster, but I have yet to be able to get multi-node training working. I think this is however unrelated to this code, but rather related to me not having the correct setup for running multi-node on our cluster. Will ask around to see if I can get it working.

In the meantime, @leifdenby (or anyone at DMI 😄), do you have a slurm setup that you could test this on? I just think it's a good idea to test on multiple different clusters to make sure that this is general enough.

sadamov · 2024-06-07T11:10:20Z

I have implemented the latest feedback, updated the CHANGELOG and added a SLURM-example submission script to /docs/examples (is that a good location?) as discussed with @leifdenby. A new small section was added to the README.md.
@joeloskarsson yes, every cluster is different and I also have to adapt my submission scripts after major changes. Do you have ticket-support with your HPC-provider? They usually know what to do...

Introduces multi-node training setup

896e9a5

sadamov requested a review from joeloskarsson May 4, 2024 13:44

sadamov added the enhancement New feature or request label May 4, 2024

Simon Adamov and others added 2 commits May 4, 2024 21:20

eval is recommended by torch to run on one device

f28f798

linter

f708da9

sadamov requested a review from leifdenby May 14, 2024 05:32

joeloskarsson requested changes May 29, 2024

View reviewed changes

train_model.py Show resolved Hide resolved

leifdenby changed the title ~~Introduces multi-node training setup~~ Introduce multi-node training setup May 30, 2024

sadamov added 2 commits June 7, 2024 12:43

Merge remote-tracking branch 'origin/main' into feature_multinode_ddp

d9778a4

Added documentation and an example slurm file

07903d5

sadamov mentioned this pull request Aug 26, 2024

Add "datastores" to represent input data from zarr, npy, etc #66

Merged

20 tasks

joeloskarsson added this to the v0.4.0 milestone Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce multi-node training setup #26

Introduce multi-node training setup #26

sadamov commented May 4, 2024

joeloskarsson left a comment

joeloskarsson commented Jun 3, 2024

sadamov commented Jun 7, 2024

Introduce multi-node training setup #26

Are you sure you want to change the base?

Introduce multi-node training setup #26

Conversation

sadamov commented May 4, 2024

Enable multi-node GPU training with SLURM

Key changes

Rationale for using SLURM

joeloskarsson left a comment

Choose a reason for hiding this comment

joeloskarsson commented Jun 3, 2024

sadamov commented Jun 7, 2024