diff --git a/examples/distributed-pytorch/README.md b/examples/distributed-pytorch/README.md new file mode 100644 index 00000000000..9927568fb4b --- /dev/null +++ b/examples/distributed-pytorch/README.md @@ -0,0 +1,31 @@ +# Distributed Training with PyTorch + +This example demonstrates how to run distributed training with PyTorch using SkyPilot. + +**The example is based on [PyTorch's official minGPT example](https://github.com/pytorch/examples/tree/main/distributed/minGPT-ddp)** + +## Using distributed training + + +The following command spawn 2 nodes with 1 L4 GPU each. + +`sky launch -c train.yaml` + +In the [train.yaml](./train.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using environment variables provided by SkyPilot. + +```yaml +run: | + cd examples/mingpt + MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1) + torchrun \ + --nnodes=$SKYPILOT_NUM_NODES \ + --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ + --master_addr=$MASTER_ADDR \ + --master_port=8008 \ + --node_rank=${SKYPILOT_NODE_RANK} \ + main.py +``` + + + +## Using `rdvz` diff --git a/examples/distributed-pytorch/train-rdzv.yaml b/examples/distributed-pytorch/train-rdzv.yaml new file mode 100644 index 00000000000..fdd880634a6 --- /dev/null +++ b/examples/distributed-pytorch/train-rdzv.yaml @@ -0,0 +1,25 @@ +name: minGPT-ddp-rdzv + +resources: + cpus: 4+ + accelerators: L4 + +num_nodes: 2 + +setup: | + git clone --depth 1 https://github.com/pytorch/examples || true + cd examples + git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp + # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5). + uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113 + +run: | + cd examples/mingpt + MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1) + torchrun \ + --nnodes=$SKYPILOT_NUM_NODES \ + --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ + --rdzv_backend=c10d \ + --rdzv_endpoint=$MASTER_ADDR:8008 \ + --rdzv_id $RANDOM \ + main.py diff --git a/examples/distributed-pytorch/train.yaml b/examples/distributed-pytorch/train.yaml new file mode 100644 index 00000000000..aeb3aa975ec --- /dev/null +++ b/examples/distributed-pytorch/train.yaml @@ -0,0 +1,25 @@ +name: minGPT-ddp + +resources: + cpus: 4+ + accelerators: L4 + +num_nodes: 2 + +setup: | + git clone --depth 1 https://github.com/pytorch/examples || true + cd examples + git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp + # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5). + uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113 + +run: | + cd examples/mingpt + MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1) + torchrun \ + --nnodes=$SKYPILOT_NUM_NODES \ + --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ + --master_addr=$MASTER_ADDR \ + --master_port=8008 \ + --node_rank=${SKYPILOT_NODE_RANK} \ + main.py