[docs] Add newer examples for AI tutorial and distributed training (#…

…4509) * Update tutorial and distributed training examples. * Add examples link * add rdvz
skypilot-org · Dec 30, 2024 · 3715be2 · 3715be2
1 parent 13501e2
commit 3715be2
Show file tree

Hide file tree

Showing 2 changed files with 49 additions and 47 deletions.
diff --git a/docs/source/getting-started/tutorial.rst b/docs/source/getting-started/tutorial.rst
@@ -2,19 +2,20 @@
 
 Tutorial: AI Training
 ======================
-This example uses SkyPilot to train a Transformer-based language model from HuggingFace.
+This example uses SkyPilot to train a GPT-like model (inspired by Karpathy's `minGPT <https://github.com/karpathy/minGPT>`_) with Distributed Data Parallel (DDP) in PyTorch.
 
-First, define a :ref:`task YAML <yaml-spec>` with the resource requirements, the setup commands,
+We define a :ref:`task YAML <yaml-spec>` with the resource requirements, the setup commands,
 and the commands to run:
 
 .. code-block:: yaml
 
-  # dnn.yaml
+  # train.yaml
 
-  name: huggingface
+  name: minGPT-ddp
 
   resources:
-    accelerators: V100:4
+      cpus: 4+
+      accelerators: L4:4  # Or A100:8, H100:8
 
   # Optional: upload a working directory to remote ~/sky_workdir.
   # Commands in "setup" and "run" will be executed under it.
@@ -30,38 +31,37 @@ and the commands to run:
   #   ~/.netrc: ~/.netrc
 
   setup: |
-    set -e  # Exit if any command failed.
-    git clone https://github.com/huggingface/transformers/ || true
-    cd transformers
-    pip install .
-    cd examples/pytorch/text-classification
-    pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
+      git clone --depth 1 https://github.com/pytorch/examples || true
+      cd examples
+      git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
+      # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
+      uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113
 
   run: |
-    set -e  # Exit if any command failed.
-    cd transformers/examples/pytorch/text-classification
-    python run_glue.py \
-      --model_name_or_path bert-base-cased \
-      --dataset_name imdb  \
-      --do_train \
-      --max_seq_length 128 \
-      --per_device_train_batch_size 32 \
-      --learning_rate 2e-5 \
-      --max_steps 50 \
-      --output_dir /tmp/imdb/ --overwrite_output_dir \
-      --fp16
+      cd examples/mingpt
+      export LOGLEVEL=INFO
+
+      echo "Starting minGPT-ddp training"
+
+      torchrun \
+      --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
+      main.py
 
 .. tip::
 
   In the YAML, the ``workdir`` and ``file_mounts`` fields are commented out. To
   learn about how to use them to mount local dirs/files or object store buckets
   (S3, GCS, R2) into your cluster, see :ref:`sync-code-artifacts`.
 
+.. tip::
+
+  The ``SKYPILOT_NUM_GPUS_PER_NODE`` environment variable is automatically set by SkyPilot to the number of GPUs per node. See :ref:`env-vars` for more.
+
 Then, launch training:
 
 .. code-block:: console
 
-   $ sky launch -c lm-cluster dnn.yaml
+   $ sky launch -c mingpt train.yaml
 
 This will provision the cheapest cluster with the required resources, execute the setup
 commands, then execute the run commands.

diff --git a/docs/source/running-jobs/distributed-jobs.rst b/docs/source/running-jobs/distributed-jobs.rst
@@ -6,39 +6,40 @@ Distributed Multi-Node Jobs
 SkyPilot supports multi-node cluster
 provisioning and distributed execution on many nodes.
 
-For example, here is a simple PyTorch Distributed training example:
+For example, here is a simple example to train a GPT-like model (inspired by Karpathy's `minGPT <https://github.com/karpathy/minGPT>`_) across 2 nodes with Distributed Data Parallel (DDP) in PyTorch.
 
 .. code-block:: yaml
-   :emphasize-lines: 6-6,21-21,23-26
+  :emphasize-lines: 6,19,23-24,26
 
-   name: resnet-distributed-app
+  name: minGPT-ddp
 
-   resources:
-     accelerators: A100:8
+  resources:
+      accelerators: A100:8
 
-   num_nodes: 2
+  num_nodes: 2
 
-   setup: |
-     pip3 install --upgrade pip
-     git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet
-     cd pytorch-distributed-resnet
-     # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
-     pip3 install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
-     mkdir -p data  && mkdir -p saved_models && cd data && \
-       wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
-     tar -xvzf cifar-10-python.tar.gz
+  setup: |
+      git clone --depth 1 https://github.com/pytorch/examples || true
+      cd examples
+      git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
+      # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
+      uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113
 
-   run: |
-     cd pytorch-distributed-resnet
+  run: |
+      cd examples/mingpt
+      export LOGLEVEL=INFO
+
+      MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
+      echo "Starting distributed training, head node: $MASTER_ADDR"
 
-     MASTER_ADDR=`echo "$SKYPILOT_NODE_IPS" | head -n1`
-     torchrun \
+      torchrun \
       --nnodes=$SKYPILOT_NUM_NODES \
-      --master_addr=$MASTER_ADDR \
       --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
-      --node_rank=$SKYPILOT_NODE_RANK \
-      --master_port=12375 \
-       resnet_ddp.py --num_epochs 20
+      --master_addr=$MASTER_ADDR \
+      --node_rank=${SKYPILOT_NODE_RANK} \
+      --master_port=8008 \
+      main.py
+
 
 In the above,
 
@@ -55,6 +56,7 @@ In the above,
 
         ulimit -n 65535
 
+You can find more `distributed training examples <https://github.com/skypilot-org/skypilot/tree/master/examples/distributed-pytorch>`_ (including `using rdvz backend for pytorch <https://github.com/skypilot-org/skypilot/blob/master/examples/distributed-pytorch/train-rdzv.yaml>`_) in our `GitHub repository <https://github.com/skypilot-org/skypilot/tree/master/examples>`_.
 
 Environment variables
 -----------------------------------------