From 3715be2865cd145f387996b7fc8b32fada61c431 Mon Sep 17 00:00:00 2001 From: Romil Bhardwaj Date: Mon, 30 Dec 2024 14:45:40 -0800 Subject: [PATCH] [docs] Add newer examples for AI tutorial and distributed training (#4509) * Update tutorial and distributed training examples. * Add examples link * add rdvz --- docs/source/getting-started/tutorial.rst | 48 +++++++++---------- docs/source/running-jobs/distributed-jobs.rst | 48 ++++++++++--------- 2 files changed, 49 insertions(+), 47 deletions(-) diff --git a/docs/source/getting-started/tutorial.rst b/docs/source/getting-started/tutorial.rst index 175f1391a6d..9b067be2876 100644 --- a/docs/source/getting-started/tutorial.rst +++ b/docs/source/getting-started/tutorial.rst @@ -2,19 +2,20 @@ Tutorial: AI Training ====================== -This example uses SkyPilot to train a Transformer-based language model from HuggingFace. +This example uses SkyPilot to train a GPT-like model (inspired by Karpathy's `minGPT `_) with Distributed Data Parallel (DDP) in PyTorch. -First, define a :ref:`task YAML ` with the resource requirements, the setup commands, +We define a :ref:`task YAML ` with the resource requirements, the setup commands, and the commands to run: .. code-block:: yaml - # dnn.yaml + # train.yaml - name: huggingface + name: minGPT-ddp resources: - accelerators: V100:4 + cpus: 4+ + accelerators: L4:4 # Or A100:8, H100:8 # Optional: upload a working directory to remote ~/sky_workdir. # Commands in "setup" and "run" will be executed under it. @@ -30,26 +31,21 @@ and the commands to run: # ~/.netrc: ~/.netrc setup: | - set -e # Exit if any command failed. - git clone https://github.com/huggingface/transformers/ || true - cd transformers - pip install . - cd examples/pytorch/text-classification - pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113 + git clone --depth 1 https://github.com/pytorch/examples || true + cd examples + git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp + # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5). + uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113 run: | - set -e # Exit if any command failed. - cd transformers/examples/pytorch/text-classification - python run_glue.py \ - --model_name_or_path bert-base-cased \ - --dataset_name imdb \ - --do_train \ - --max_seq_length 128 \ - --per_device_train_batch_size 32 \ - --learning_rate 2e-5 \ - --max_steps 50 \ - --output_dir /tmp/imdb/ --overwrite_output_dir \ - --fp16 + cd examples/mingpt + export LOGLEVEL=INFO + + echo "Starting minGPT-ddp training" + + torchrun \ + --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ + main.py .. tip:: @@ -57,11 +53,15 @@ and the commands to run: learn about how to use them to mount local dirs/files or object store buckets (S3, GCS, R2) into your cluster, see :ref:`sync-code-artifacts`. +.. tip:: + + The ``SKYPILOT_NUM_GPUS_PER_NODE`` environment variable is automatically set by SkyPilot to the number of GPUs per node. See :ref:`env-vars` for more. + Then, launch training: .. code-block:: console - $ sky launch -c lm-cluster dnn.yaml + $ sky launch -c mingpt train.yaml This will provision the cheapest cluster with the required resources, execute the setup commands, then execute the run commands. diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst index f6c8cba9c9d..7c3421aa276 100644 --- a/docs/source/running-jobs/distributed-jobs.rst +++ b/docs/source/running-jobs/distributed-jobs.rst @@ -6,39 +6,40 @@ Distributed Multi-Node Jobs SkyPilot supports multi-node cluster provisioning and distributed execution on many nodes. -For example, here is a simple PyTorch Distributed training example: +For example, here is a simple example to train a GPT-like model (inspired by Karpathy's `minGPT `_) across 2 nodes with Distributed Data Parallel (DDP) in PyTorch. .. code-block:: yaml - :emphasize-lines: 6-6,21-21,23-26 + :emphasize-lines: 6,19,23-24,26 - name: resnet-distributed-app + name: minGPT-ddp - resources: - accelerators: A100:8 + resources: + accelerators: A100:8 - num_nodes: 2 + num_nodes: 2 - setup: | - pip3 install --upgrade pip - git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet - cd pytorch-distributed-resnet - # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5). - pip3 install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113 - mkdir -p data && mkdir -p saved_models && cd data && \ - wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz - tar -xvzf cifar-10-python.tar.gz + setup: | + git clone --depth 1 https://github.com/pytorch/examples || true + cd examples + git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp + # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5). + uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113 - run: | - cd pytorch-distributed-resnet + run: | + cd examples/mingpt + export LOGLEVEL=INFO + + MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1) + echo "Starting distributed training, head node: $MASTER_ADDR" - MASTER_ADDR=`echo "$SKYPILOT_NODE_IPS" | head -n1` - torchrun \ + torchrun \ --nnodes=$SKYPILOT_NUM_NODES \ - --master_addr=$MASTER_ADDR \ --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \ - --node_rank=$SKYPILOT_NODE_RANK \ - --master_port=12375 \ - resnet_ddp.py --num_epochs 20 + --master_addr=$MASTER_ADDR \ + --node_rank=${SKYPILOT_NODE_RANK} \ + --master_port=8008 \ + main.py + In the above, @@ -55,6 +56,7 @@ In the above, ulimit -n 65535 +You can find more `distributed training examples `_ (including `using rdvz backend for pytorch `_) in our `GitHub repository `_. Environment variables -----------------------------------------