-
Notifications
You must be signed in to change notification settings - Fork 531
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Example] PyTorch distributed training with minGPT (#4464)
* Add example for distributed pytorch * update * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Fix --------- Co-authored-by: Romil Bhardwaj <[email protected]>
- Loading branch information
1 parent
baf028d
commit ce1cb83
Showing
3 changed files
with
139 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
# Distributed Training with PyTorch | ||
|
||
This example demonstrates how to run distributed training with PyTorch using SkyPilot. | ||
|
||
**The example is based on [PyTorch's official minGPT example](https://github.com/pytorch/examples/tree/main/distributed/minGPT-ddp)** | ||
|
||
|
||
## Overview | ||
|
||
There are two ways to run distributed training with PyTorch: | ||
|
||
1. Using normal `torchrun` | ||
2. Using `rdvz` backend | ||
|
||
The main difference between the two for fixed-size distributed training is that `rdvz` backend automatically handles the rank for each node, while `torchrun` requires the rank to be set manually. | ||
|
||
SkyPilot offers convinient built-in environment variables to help you start distributed training easily. | ||
|
||
### Using normal `torchrun` | ||
|
||
|
||
The following command will spawn 2 nodes with 2 L4 GPU each: | ||
``` | ||
sky launch -c train train.yaml | ||
``` | ||
|
||
In [train.yaml](./train.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using [environment variables](https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html#skypilot-environment-variables) provided by SkyPilot. | ||
|
||
```yaml | ||
run: | | ||
cd examples/mingpt | ||
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1) | ||
torchrun \ | ||
--nnodes=$SKYPILOT_NUM_NODES \ | ||
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ | ||
--master_addr=$MASTER_ADDR \ | ||
--master_port=8008 \ | ||
--node_rank=${SKYPILOT_NODE_RANK} \ | ||
main.py | ||
``` | ||
### Using `rdzv` backend | ||
|
||
`rdzv` is an alternative backend for distributed training: | ||
|
||
``` | ||
sky launch -c train-rdzv train-rdzv.yaml | ||
``` | ||
|
||
In [train-rdzv.yaml](./train-rdzv.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using [environment variables](https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html#skypilot-environment-variables) provided by SkyPilot. | ||
|
||
```yaml | ||
run: | | ||
cd examples/mingpt | ||
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1) | ||
echo "Starting distributed training, head node: $MASTER_ADDR" | ||
torchrun \ | ||
--nnodes=$SKYPILOT_NUM_NODES \ | ||
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ | ||
--rdzv_backend=c10d \ | ||
--rdzv_endpoint=$MASTER_ADDR:29500 \ | ||
--rdzv_id $SKYPILOT_TASK_ID \ | ||
main.py | ||
``` | ||
|
||
|
||
## Scale up | ||
|
||
If you would like to scale up the training, you can simply change the resources requirement, and SkyPilot's built-in environment variables will be set automatically. | ||
|
||
For example, the following command will spawn 4 nodes with 4 L4 GPUs each. | ||
|
||
``` | ||
sky launch -c train train.yaml --num-nodes 4 --gpus L4:4 --cpus 8+ | ||
``` | ||
We increase the `--cpus` to 8+ as well to avoid the performance to be bottlenecked by the CPU. | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
name: minGPT-ddp-rdzv | ||
|
||
resources: | ||
cpus: 4+ | ||
accelerators: L4 | ||
|
||
num_nodes: 2 | ||
|
||
setup: | | ||
git clone --depth 1 https://github.com/pytorch/examples || true | ||
cd examples | ||
git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp | ||
# SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5). | ||
uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113 | ||
run: | | ||
cd examples/mingpt | ||
export LOGLEVEL=INFO | ||
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1) | ||
echo "Starting distributed training, head node: $MASTER_ADDR" | ||
torchrun \ | ||
--nnodes=$SKYPILOT_NUM_NODES \ | ||
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ | ||
--rdzv_backend=c10d \ | ||
--rdzv_endpoint=$MASTER_ADDR:29500 \ | ||
--rdzv_id $SKYPILOT_TASK_ID \ | ||
main.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
name: minGPT-ddp | ||
|
||
resources: | ||
cpus: 4+ | ||
accelerators: L4 | ||
|
||
num_nodes: 2 | ||
|
||
setup: | | ||
git clone --depth 1 https://github.com/pytorch/examples || true | ||
cd examples | ||
git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp | ||
# SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5). | ||
uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113 | ||
run: | | ||
cd examples/mingpt | ||
export LOGLEVEL=INFO | ||
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1) | ||
echo "Starting distributed training, head node: $MASTER_ADDR" | ||
torchrun \ | ||
--nnodes=$SKYPILOT_NUM_NODES \ | ||
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ | ||
--master_addr=$MASTER_ADDR \ | ||
--master_port=8008 \ | ||
--node_rank=${SKYPILOT_NODE_RANK} \ | ||
main.py |