From ce1cb837e945c96eff57d5d3385c41cab33506f2 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Wed, 18 Dec 2024 11:26:24 -0800 Subject: [PATCH] [Example] PyTorch distributed training with minGPT (#4464) * Add example for distributed pytorch * update * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj * Fix --------- Co-authored-by: Romil Bhardwaj --- examples/distributed-pytorch/README.md | 81 ++++++++++++++++++++ examples/distributed-pytorch/train-rdzv.yaml | 29 +++++++ examples/distributed-pytorch/train.yaml | 29 +++++++ 3 files changed, 139 insertions(+) create mode 100644 examples/distributed-pytorch/README.md create mode 100644 examples/distributed-pytorch/train-rdzv.yaml create mode 100644 examples/distributed-pytorch/train.yaml diff --git a/examples/distributed-pytorch/README.md b/examples/distributed-pytorch/README.md new file mode 100644 index 00000000000..6c2f7092269 --- /dev/null +++ b/examples/distributed-pytorch/README.md @@ -0,0 +1,81 @@ +# Distributed Training with PyTorch + +This example demonstrates how to run distributed training with PyTorch using SkyPilot. + +**The example is based on [PyTorch's official minGPT example](https://github.com/pytorch/examples/tree/main/distributed/minGPT-ddp)** + + +## Overview + +There are two ways to run distributed training with PyTorch: + +1. Using normal `torchrun` +2. Using `rdvz` backend + +The main difference between the two for fixed-size distributed training is that `rdvz` backend automatically handles the rank for each node, while `torchrun` requires the rank to be set manually. + +SkyPilot offers convinient built-in environment variables to help you start distributed training easily. + +### Using normal `torchrun` + + +The following command will spawn 2 nodes with 2 L4 GPU each: +``` +sky launch -c train train.yaml +``` + +In [train.yaml](./train.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using [environment variables](https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html#skypilot-environment-variables) provided by SkyPilot. + +```yaml +run: | + cd examples/mingpt + MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1) + torchrun \ + --nnodes=$SKYPILOT_NUM_NODES \ + --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ + --master_addr=$MASTER_ADDR \ + --master_port=8008 \ + --node_rank=${SKYPILOT_NODE_RANK} \ + main.py +``` + + + +### Using `rdzv` backend + +`rdzv` is an alternative backend for distributed training: + +``` +sky launch -c train-rdzv train-rdzv.yaml +``` + +In [train-rdzv.yaml](./train-rdzv.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using [environment variables](https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html#skypilot-environment-variables) provided by SkyPilot. + +```yaml +run: | + cd examples/mingpt + MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1) + echo "Starting distributed training, head node: $MASTER_ADDR" + + torchrun \ + --nnodes=$SKYPILOT_NUM_NODES \ + --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ + --rdzv_backend=c10d \ + --rdzv_endpoint=$MASTER_ADDR:29500 \ + --rdzv_id $SKYPILOT_TASK_ID \ + main.py +``` + + +## Scale up + +If you would like to scale up the training, you can simply change the resources requirement, and SkyPilot's built-in environment variables will be set automatically. + +For example, the following command will spawn 4 nodes with 4 L4 GPUs each. + +``` +sky launch -c train train.yaml --num-nodes 4 --gpus L4:4 --cpus 8+ +``` + +We increase the `--cpus` to 8+ as well to avoid the performance to be bottlenecked by the CPU. + diff --git a/examples/distributed-pytorch/train-rdzv.yaml b/examples/distributed-pytorch/train-rdzv.yaml new file mode 100644 index 00000000000..3bcd63dde4c --- /dev/null +++ b/examples/distributed-pytorch/train-rdzv.yaml @@ -0,0 +1,29 @@ +name: minGPT-ddp-rdzv + +resources: + cpus: 4+ + accelerators: L4 + +num_nodes: 2 + +setup: | + git clone --depth 1 https://github.com/pytorch/examples || true + cd examples + git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp + # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5). + uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113 + +run: | + cd examples/mingpt + export LOGLEVEL=INFO + + MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1) + echo "Starting distributed training, head node: $MASTER_ADDR" + + torchrun \ + --nnodes=$SKYPILOT_NUM_NODES \ + --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ + --rdzv_backend=c10d \ + --rdzv_endpoint=$MASTER_ADDR:29500 \ + --rdzv_id $SKYPILOT_TASK_ID \ + main.py diff --git a/examples/distributed-pytorch/train.yaml b/examples/distributed-pytorch/train.yaml new file mode 100644 index 00000000000..b45941e1485 --- /dev/null +++ b/examples/distributed-pytorch/train.yaml @@ -0,0 +1,29 @@ +name: minGPT-ddp + +resources: + cpus: 4+ + accelerators: L4 + +num_nodes: 2 + +setup: | + git clone --depth 1 https://github.com/pytorch/examples || true + cd examples + git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp + # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5). + uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113 + +run: | + cd examples/mingpt + export LOGLEVEL=INFO + + MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1) + echo "Starting distributed training, head node: $MASTER_ADDR" + + torchrun \ + --nnodes=$SKYPILOT_NUM_NODES \ + --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ + --master_addr=$MASTER_ADDR \ + --master_port=8008 \ + --node_rank=${SKYPILOT_NODE_RANK} \ + main.py