Skip to content

Commit

Permalink
[Example] PyTorch distributed training with minGPT (#4464)
Browse files Browse the repository at this point in the history
* Add example for distributed pytorch

* update

* Update examples/distributed-pytorch/README.md

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update examples/distributed-pytorch/README.md

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update examples/distributed-pytorch/README.md

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update examples/distributed-pytorch/README.md

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update examples/distributed-pytorch/README.md

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update examples/distributed-pytorch/README.md

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update examples/distributed-pytorch/README.md

Co-authored-by: Romil Bhardwaj <[email protected]>

* Fix

---------

Co-authored-by: Romil Bhardwaj <[email protected]>
  • Loading branch information
Michaelvll and romilbhardwaj authored Dec 18, 2024
1 parent baf028d commit ce1cb83
Show file tree
Hide file tree
Showing 3 changed files with 139 additions and 0 deletions.
81 changes: 81 additions & 0 deletions examples/distributed-pytorch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Distributed Training with PyTorch

This example demonstrates how to run distributed training with PyTorch using SkyPilot.

**The example is based on [PyTorch's official minGPT example](https://github.com/pytorch/examples/tree/main/distributed/minGPT-ddp)**


## Overview

There are two ways to run distributed training with PyTorch:

1. Using normal `torchrun`
2. Using `rdvz` backend

The main difference between the two for fixed-size distributed training is that `rdvz` backend automatically handles the rank for each node, while `torchrun` requires the rank to be set manually.

SkyPilot offers convinient built-in environment variables to help you start distributed training easily.

### Using normal `torchrun`


The following command will spawn 2 nodes with 2 L4 GPU each:
```
sky launch -c train train.yaml
```

In [train.yaml](./train.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using [environment variables](https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html#skypilot-environment-variables) provided by SkyPilot.

```yaml
run: |
cd examples/mingpt
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--master_addr=$MASTER_ADDR \
--master_port=8008 \
--node_rank=${SKYPILOT_NODE_RANK} \
main.py
```
### Using `rdzv` backend

`rdzv` is an alternative backend for distributed training:

```
sky launch -c train-rdzv train-rdzv.yaml
```

In [train-rdzv.yaml](./train-rdzv.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using [environment variables](https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html#skypilot-environment-variables) provided by SkyPilot.

```yaml
run: |
cd examples/mingpt
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed training, head node: $MASTER_ADDR"
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:29500 \
--rdzv_id $SKYPILOT_TASK_ID \
main.py
```


## Scale up

If you would like to scale up the training, you can simply change the resources requirement, and SkyPilot's built-in environment variables will be set automatically.

For example, the following command will spawn 4 nodes with 4 L4 GPUs each.

```
sky launch -c train train.yaml --num-nodes 4 --gpus L4:4 --cpus 8+
```
We increase the `--cpus` to 8+ as well to avoid the performance to be bottlenecked by the CPU.
29 changes: 29 additions & 0 deletions examples/distributed-pytorch/train-rdzv.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: minGPT-ddp-rdzv

resources:
cpus: 4+
accelerators: L4

num_nodes: 2

setup: |
git clone --depth 1 https://github.com/pytorch/examples || true
cd examples
git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
# SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113
run: |
cd examples/mingpt
export LOGLEVEL=INFO
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed training, head node: $MASTER_ADDR"
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:29500 \
--rdzv_id $SKYPILOT_TASK_ID \
main.py
29 changes: 29 additions & 0 deletions examples/distributed-pytorch/train.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: minGPT-ddp

resources:
cpus: 4+
accelerators: L4

num_nodes: 2

setup: |
git clone --depth 1 https://github.com/pytorch/examples || true
cd examples
git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
# SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113
run: |
cd examples/mingpt
export LOGLEVEL=INFO
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed training, head node: $MASTER_ADDR"
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--master_addr=$MASTER_ADDR \
--master_port=8008 \
--node_rank=${SKYPILOT_NODE_RANK} \
main.py

0 comments on commit ce1cb83

Please sign in to comment.