Skip to content

Conversation

@r4victor
Copy link
Collaborator

@r4victor r4victor commented Oct 22, 2025

#2802

The PR adds support for Runpod Instant clusters. dstack can now run multi-node tasks (2-8 nodes) on Runpod.

Implementation details:

  • Added a new Compute mixin ComputeWithGroupProvisioningSupport for backends that can provision/delete multiple instances at once.
  • Implemented ComputeWithGroupProvisioningSupport for RunPod.
  • Added ComputeGroupModel that's used to terminate the cluster when all instances become terminating.
  • process_submitted_jobs now can provision multiple jobs for multi-node tasks at once if the backend supports it (when provisioning master job). Introduced JobModel.waiting_master_job to avoid race conditions with one-by-one jobs processing.

Notes:

  • Tested only 2x H100:8 and 2x A100:8 clusters since others were not available at the time.
  • NCCL tests work but the bandwidth is suboptimal. This is a Runpod internal issue and they are on it.

TODO:

  • templated_id is currently required in the Runpod API when creating cluster. Waiting for Runpod to make it optional. Currently pytorch template is hardcoded.

@r4victor r4victor merged commit 1c56561 into master Oct 28, 2025
28 checks passed
@r4victor r4victor deleted the issue_2802_runpod_instant_clusters branch October 28, 2025 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants