Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Add newer examples for AI tutorial and distributed training #4509

Merged
merged 3 commits into from
Dec 30, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 24 additions & 24 deletions docs/source/getting-started/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,20 @@

Tutorial: AI Training
======================
This example uses SkyPilot to train a Transformer-based language model from HuggingFace.
This example uses SkyPilot to train a GPT-like model (inspired by Karpathy's `minGPT <https://github.com/karpathy/minGPT>`_) with Distributed Data Parallel (DDP) in PyTorch.

First, define a :ref:`task YAML <yaml-spec>` with the resource requirements, the setup commands,
We define a :ref:`task YAML <yaml-spec>` with the resource requirements, the setup commands,
and the commands to run:

.. code-block:: yaml

# dnn.yaml
# train.yaml

name: huggingface
name: minGPT-ddp

resources:
accelerators: V100:4
cpus: 4+
accelerators: L4:4 # Or A100:8, H100:8

# Optional: upload a working directory to remote ~/sky_workdir.
# Commands in "setup" and "run" will be executed under it.
Expand All @@ -30,38 +31,37 @@ and the commands to run:
# ~/.netrc: ~/.netrc

setup: |
set -e # Exit if any command failed.
git clone https://github.com/huggingface/transformers/ || true
cd transformers
pip install .
cd examples/pytorch/text-classification
pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
git clone --depth 1 https://github.com/pytorch/examples || true
cd examples
git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
# SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113

run: |
set -e # Exit if any command failed.
cd transformers/examples/pytorch/text-classification
python run_glue.py \
--model_name_or_path bert-base-cased \
--dataset_name imdb \
--do_train \
--max_seq_length 128 \
--per_device_train_batch_size 32 \
--learning_rate 2e-5 \
--max_steps 50 \
--output_dir /tmp/imdb/ --overwrite_output_dir \
--fp16
cd examples/mingpt
export LOGLEVEL=INFO

echo "Starting minGPT-ddp training"

torchrun \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
main.py

.. tip::

In the YAML, the ``workdir`` and ``file_mounts`` fields are commented out. To
learn about how to use them to mount local dirs/files or object store buckets
(S3, GCS, R2) into your cluster, see :ref:`sync-code-artifacts`.

.. tip::

The ``SKYPILOT_NUM_GPUS_PER_NODE`` environment variable is automatically set by SkyPilot to the number of GPUs per node. See :ref:`env-vars` for more.

Then, launch training:

.. code-block:: console

$ sky launch -c lm-cluster dnn.yaml
$ sky launch -c mingpt train.yaml

This will provision the cheapest cluster with the required resources, execute the setup
commands, then execute the run commands.
Expand Down
48 changes: 25 additions & 23 deletions docs/source/running-jobs/distributed-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,39 +6,40 @@ Distributed Multi-Node Jobs
SkyPilot supports multi-node cluster
provisioning and distributed execution on many nodes.

For example, here is a simple PyTorch Distributed training example:
For example, here is a simple example to train a GPT-like model (inspired by Karpathy's `minGPT <https://github.com/karpathy/minGPT>`_) across 2 nodes with Distributed Data Parallel (DDP) in PyTorch.

.. code-block:: yaml
:emphasize-lines: 6-6,21-21,23-26
:emphasize-lines: 6,19,23-24,26

name: resnet-distributed-app
name: minGPT-ddp

resources:
accelerators: A100:8
resources:
accelerators: A100:8

num_nodes: 2
num_nodes: 2

setup: |
pip3 install --upgrade pip
git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet
cd pytorch-distributed-resnet
# SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
pip3 install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
mkdir -p data && mkdir -p saved_models && cd data && \
wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xvzf cifar-10-python.tar.gz
setup: |
git clone --depth 1 https://github.com/pytorch/examples || true
cd examples
git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
# SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113

run: |
cd pytorch-distributed-resnet
run: |
cd examples/mingpt
export LOGLEVEL=INFO

MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed training, head node: $MASTER_ADDR"

MASTER_ADDR=`echo "$SKYPILOT_NODE_IPS" | head -n1`
torchrun \
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--master_addr=$MASTER_ADDR \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--node_rank=$SKYPILOT_NODE_RANK \
--master_port=12375 \
resnet_ddp.py --num_epochs 20
--master_addr=$MASTER_ADDR \
--node_rank=${SKYPILOT_NODE_RANK} \
--master_port=8008 \
main.py


In the above,

Expand All @@ -55,6 +56,7 @@ In the above,

ulimit -n 65535

You can find more `distributed training examples <https://github.com/skypilot-org/skypilot/tree/master/examples/distributed-pytorch>`_ in our `GitHub repository <https://github.com/skypilot-org/skypilot/tree/master/examples>`_.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can mention the keyword rdvz in this sentence as well so that people can easier finding it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah good point, added now.


Environment variables
-----------------------------------------
Expand Down
Loading