Add example for distributed pytorch

skypilot-org · Dec 12, 2024 · 179520e · 179520e
1 parent bc4a42e
commit 179520e
Show file tree

Hide file tree

Showing 3 changed files with 81 additions and 0 deletions.
diff --git a/examples/distributed-pytorch/README.md b/examples/distributed-pytorch/README.md
@@ -0,0 +1,31 @@
+# Distributed Training with PyTorch
+
+This example demonstrates how to run distributed training with PyTorch using SkyPilot.
+
+**The example is based on [PyTorch's official minGPT example](https://github.com/pytorch/examples/tree/main/distributed/minGPT-ddp)**
+
+## Using distributed training
+
+
+The following command spawn 2 nodes with 1 L4 GPU each. 
+
+`sky launch -c train.yaml`
+
+In the [train.yaml](./train.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using environment variables provided by SkyPilot.
+
+```yaml
+run: |
+    cd examples/mingpt
+    MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
+    torchrun \
+    --nnodes=$SKYPILOT_NUM_NODES \
+    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
+    --master_addr=$MASTER_ADDR \
+    --master_port=8008 \
+    --node_rank=${SKYPILOT_NODE_RANK} \
+    main.py
+```
+
+
+
+## Using `rdvz`
diff --git a/examples/distributed-pytorch/train-rdzv.yaml b/examples/distributed-pytorch/train-rdzv.yaml
@@ -0,0 +1,25 @@
+name: minGPT-ddp-rdzv
+
+resources:
+    cpus: 4+
+    accelerators: L4
+
+num_nodes: 2
+
+setup: |
+    git clone --depth 1 https://github.com/pytorch/examples || true
+    cd examples
+    git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
+    # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
+    uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113
+
+run: |
+    cd examples/mingpt
+    MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
+    torchrun \
+    --nnodes=$SKYPILOT_NUM_NODES \
+    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
+    --rdzv_backend=c10d \
+    --rdzv_endpoint=$MASTER_ADDR:8008 \
+    --rdzv_id $RANDOM \
+    main.py
diff --git a/examples/distributed-pytorch/train.yaml b/examples/distributed-pytorch/train.yaml
@@ -0,0 +1,25 @@
+name: minGPT-ddp
+
+resources:
+    cpus: 4+
+    accelerators: L4
+
+num_nodes: 2
+
+setup: |
+    git clone --depth 1 https://github.com/pytorch/examples || true
+    cd examples
+    git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
+    # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
+    uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113
+
+run: |
+    cd examples/mingpt
+    MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
+    torchrun \
+    --nnodes=$SKYPILOT_NUM_NODES \
+    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
+    --master_addr=$MASTER_ADDR \
+    --master_port=8008 \
+    --node_rank=${SKYPILOT_NODE_RANK} \
+    main.py