Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
Michaelvll committed Dec 12, 2024
1 parent 179520e commit 2dd3af5
Show file tree
Hide file tree
Showing 3 changed files with 60 additions and 5 deletions.
53 changes: 50 additions & 3 deletions examples/distributed-pytorch/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,22 @@ This example demonstrates how to run distributed training with PyTorch using Sky

**The example is based on [PyTorch's official minGPT example](https://github.com/pytorch/examples/tree/main/distributed/minGPT-ddp)**

## Using distributed training

## Overview

The following command spawn 2 nodes with 1 L4 GPU each.
There are two ways to run distributed training with PyTorch:

1. Using normal `torchrun`
2. Using `rdvz` backend

The main difference between the two for fixed-size distributed training is that `rdvz` backend automatically handles the rank for each node, while `torchrun` requires the rank to be set manually.

SkyPilot offers easy built-in environment variables to help you start distributed training easily.

### Using normal `torchrun`


The following command spawn 2 nodes with 2 L4 GPU each.

`sky launch -c train.yaml`

Expand All @@ -28,4 +40,39 @@ run: |
## Using `rdvz`
### Using `rdvz` backend

`rdvz` is an alternative backend for distributed training:

```
sky launch -c train-rdzv.yaml
```

In the [train-rdzv.yaml](./train-rdzv.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using environment variables provided by SkyPilot.

```yaml
run: |
cd examples/mingpt
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed training, head node: $MASTER_ADDR"
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:29500 \
--rdzv_id $SKYPILOT_TASK_ID \
main.py
```


## Scale up

If you would like to scale up the training, you can simply change the resources requirement, and SkyPilot's built-in environment variables will be set automatically.

For example, the following command will spawn 4 nodes with 4 L4 GPUs each.

`sky launch -c train.yaml --num-nodes 2 --gpus L4:2 --cpus 8+`

We increase the `--cpus` to 8+ as well to avoid the performance to be bottlenecked by the CPU.

8 changes: 6 additions & 2 deletions examples/distributed-pytorch/train-rdzv.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,15 @@ setup: |
run: |
cd examples/mingpt
export LOGLEVEL=INFO
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed training, head node: $MASTER_ADDR"
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:8008 \
--rdzv_id $RANDOM \
--rdzv_endpoint=$MASTER_ADDR:29500 \
--rdzv_id $SKYPILOT_TASK_ID \
main.py
4 changes: 4 additions & 0 deletions examples/distributed-pytorch/train.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,11 @@ setup: |
run: |
cd examples/mingpt
export LOGLEVEL=INFO
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed training, head node: $MASTER_ADDR"
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
Expand Down

0 comments on commit 2dd3af5

Please sign in to comment.