-
Notifications
You must be signed in to change notification settings - Fork 12
Add Slurm Cluster documentation #306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
2cfcde9
Init
muhsinking a549883
First draft
muhsinking 7a923fc
update
muhsinking e5521ee
Update sidebar titles
muhsinking 17ee519
Merge branch 'main' into ic-slurm-v2
muhsinking 868d14c
update
muhsinking d6a625a
remove unused sections
muhsinking 33696d1
link to slurm docs
muhsinking 27b3cea
Update
muhsinking dfcbf0e
update
muhsinking aebb1c7
Add troubleshooting
muhsinking e2545b9
update ic title
muhsinking 7913346
Update
muhsinking 67f04b4
Merge branch 'main' into ic-slurm-v2
muhsinking f4a382e
info -> tip
muhsinking a7335dc
Update
muhsinking 58a2c5f
Update
muhsinking 77e9b05
Update
muhsinking ff76d26
Merge branch 'main' into ic-slurm-v2
muhsinking fb368bd
Merge branch 'main' into ic-slurm-v2
muhsinking File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
--- | ||
title: Slurm Clusters | ||
sidebarTitle: Slurm Clusters | ||
description: Deploy Slurm Clusters on Runpod with zero configuration | ||
tag: "BETA" | ||
--- | ||
|
||
<Note> | ||
Slurm Clusters are currently in beta. If you'd like to provide feedback, please [join our Discord](https://discord.gg/runpod). | ||
</Note> | ||
|
||
Runpod Slurm Clusters provide a managed high-performance computing and scheduling solution that enables you to rapidly create and manage Slurm Clusters with minimal setup. | ||
|
||
For more information on working with Slurm, refer to the [Slurm documentation](https://slurm.schedmd.com/documentation.html). | ||
|
||
## Key features | ||
|
||
Slurm Clusters eliminate the traditional complexity of cluster orchestration by providing: | ||
|
||
- **Zero configuration setup:** Slurm and munge are pre-installed and fully configured. | ||
- **Instant provisioning:** Clusters deploy rapidly with minimal setup. | ||
- **Automatic role assignment:** Runpod automatically designates controller and agent nodes. | ||
- **Built-in optimizations:** Pre-configured for optimal NCCL performance. | ||
- **Full Slurm compatibility:** All standard Slurm commands work out-of-the-box. | ||
|
||
<Tip> | ||
|
||
If you prefer to manually configure your Slurm deployment, see [Deploy an Instant Cluster with Slurm (unmanaged)](/instant-clusters/slurm) for a step-by-step guide. | ||
|
||
</Tip> | ||
|
||
## Deploy a Slurm Cluster | ||
|
||
1. Open the [Instant Clusters page](https://console.runpod.io/cluster) on the Runpod console. | ||
2. Click **Create Cluster**. | ||
3. Select **Slurm Cluster** from the cluster type dropdown menu. | ||
4. Configure your cluster specifications: | ||
- **Cluster name**: Enter a descriptive name for your cluster. | ||
- **Pod count**: Choose the number of Pods in your cluster. | ||
- **GPU type**: Select your preferred [GPU type](/references/gpu-types). | ||
- **Region**: Choose your deployment region. | ||
- **Network volume** (optional): Add a [network volume](/pods/storage/create-network-volumes) for persistent/shared storage. If using a network volume, ensure the region matches your cluster region. | ||
- **Pod template**: Select a [Pod template](/pods/templates/overview) or click **Edit Template** to customize start commands, environment variables, ports, or [container/volume disk](/pods/storage/types) capacity. | ||
<Warning> | ||
Slurm Clusters currently only support official Runpod Pytorch images. If you deploy using a different image, the Slurm process will not start. | ||
</Warning> | ||
5. Click **Deploy Cluster**. | ||
|
||
## Connect to a Slurm Cluster | ||
|
||
Once deployment completes, you can access your cluster from the [Instant Clusters page](https://console.runpod.io/cluster). | ||
|
||
From this page you can select a cluster to view it's component nodes, including a label indicating the **Slurm controller** (primary node) and **Slurm agents** (secondary nodes). Expand a node to view details like availability, GPU/storage utilization, and options for connection and management. | ||
|
||
Connect to a node using the **Connect** button, or using any of the [connection methods supported by Pods](/pods/connect-to-a-pod). | ||
|
||
## Submit and manage jobs | ||
|
||
All standard Slurm commands are available without configuration. For example, you can: | ||
|
||
Check cluster status and available resources: | ||
|
||
```bash | ||
sinfo | ||
``` | ||
|
||
Submit a job to the cluster from the Slurm controller node: | ||
|
||
```bash | ||
sbatch your-job-script.sh | ||
``` | ||
|
||
Monitor job queue and status: | ||
|
||
```bash | ||
squeue | ||
``` | ||
|
||
View detailed job information from the Slurm controller node: | ||
|
||
```bash | ||
scontrol show job JOB_ID | ||
``` | ||
|
||
You can find the output of Slurm agents in their individual container logs. | ||
|
||
## Advanced configuration | ||
|
||
While Runpod's Slurm Clusters work out-of-the-box, you can customize your configuration by connecting to the Slurm controller node using the [web terminal or SSH](/pods/connect-to-pods). | ||
|
||
Access Slurm configuration files in their standard locations: | ||
- `/etc/slurm/slurm.conf` - Main configuration file. | ||
- `/etc/slurm/gres.conf` - Generic resource configuration. | ||
|
||
Modify these files as needed for your specific requirements. | ||
|
||
## Troubleshooting | ||
|
||
If you encounter issues with your Slurm Cluster, try the following: | ||
|
||
- **Jobs stuck in pending state:** Check resource availability with `sinfo` and ensure requested resources are available. If you need more resources, you can add more nodes to your cluster. | ||
- **Authentication errors:** Munge is pre-configured, but if issues arise, verify the munge service is running on all nodes. | ||
|
||
For additional support, contact [Runpod support](https://contact.runpod.io/) with your cluster ID and specific error messages. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.