Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLM] llm.c training for GPT 2 #3611

Merged
merged 32 commits into from
May 31, 2024
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
374f0f2
add gpt-2 example
Michaelvll May 28, 2024
79323f7
Use ubuntu for GCP
Michaelvll May 29, 2024
03623ee
fix ncl
Michaelvll May 29, 2024
3636ea6
Fix GPT-2
Michaelvll May 29, 2024
1694ecd
add train and data
Michaelvll May 29, 2024
2c80dcb
use 8 gpus
Michaelvll May 29, 2024
1bef798
revert gcp change
Michaelvll May 29, 2024
9282873
update readme
Michaelvll May 29, 2024
0ee942c
Add GCP image
Michaelvll May 29, 2024
5af0d93
make file_mounts more general
Michaelvll May 29, 2024
71bcdd0
avoid any_of
Michaelvll May 29, 2024
488347f
change back to use ubuntu image with wait for GPU
Michaelvll May 30, 2024
8ec06a8
Merge branch 'gpt-2' of https://github.com/skypilot-org/skypilot into…
Michaelvll May 30, 2024
2e5bacf
wait cuda installation
Michaelvll May 30, 2024
c070da0
Add retry for file mount and use env for bucket name
Michaelvll May 30, 2024
87d2a3c
revert retries
Michaelvll May 30, 2024
d6e9554
update the image
Michaelvll May 30, 2024
0c2d799
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
Michaelvll May 30, 2024
ef26ecd
change to docker for better dependency
Michaelvll May 30, 2024
2b0a085
revert changes in gcp template
Michaelvll May 30, 2024
aa8ecfe
avoid using docker on lambda
Michaelvll May 30, 2024
265e43c
Add single GPU
Michaelvll May 30, 2024
598dca5
Elaborate readme
Michaelvll May 30, 2024
3056c2c
Update llm/gpt-2/README.md
Michaelvll May 30, 2024
815d23c
fix
Michaelvll May 30, 2024
faf63d8
Merge branch 'gpt-2' of https://github.com/skypilot-org/skypilot into…
Michaelvll May 30, 2024
4c44935
address comments
Michaelvll May 31, 2024
3b7312e
Fix data fetching
Michaelvll May 31, 2024
b6566d7
Add visualization
Michaelvll May 31, 2024
bea72d5
update
Michaelvll May 31, 2024
8887435
reduce cpu cost
Michaelvll May 31, 2024
7609990
update loss curve
Michaelvll May 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions llm/gpt-2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Run GPT-2 in llm.c on any cloud with SkyPilot

This is a reproducible package of llm.c's GPT-2 (124M) training by @karpathy (https://github.com/karpathy/llm.c/discussions/481)
With SkyPilot, you can run GPT-2 (124M) training on any cloud.

## Prerequisites

1. Install [SkyPilot](https://github.com/skypilot-org/skypilot):
```bash
pip install skypilot-nightly
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
```
2. Enable clouds for SkyPilot:
```bash
sky check
```
Please check the instructions for enabling clouds at [SkyPilot doc](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
3. Download the YAMLs in this directory for data processing and training:
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
```bash
wget https://github.com/skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2-data.yaml
wget https://github.com/skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2-train.yaml
```

## Data processing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I like the fact that we have different YAMLs for pre-processing and training, but one concern is it may alienate lambda-only, azure-only or fluidstack-only users since they won't have any cloud object store access to write preprocessed data to.

If it's not too complicated, can we add a one-shot YAML that does all pre-processing and training in a single YAML?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Just separate it into two different sections. One for a combined YAML anothe for the pipeline. Wdyt?


Run the following command to process the training data on a CPU VM and store it in a cloud bucket for future use (replace `your-bucket-name` with your bucket name):
```bash
sky launch -c gpt2-data gpt2-data.yaml --env BUCKET_NAME=your-bucket-name
```


## Training

After the data is processed, you can then train the model on a GPU VM with 8 A100 GPUs (replace `your-bucket-name` with your bucket name):

```bash
sky launch -c gpt2-train --detach-setup gpt2-train.yaml --env BUCKET_NAME=your-bucket-name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since cost + time seems like a big motivation behind this work ("Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20"), should we mention that here? Perhaps we can show the optimizer output?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added the comparison in the sentence. How does it look to you?

```

Or, you can train the model with a single A100, by adding `--gpu A100`:
```bash
sky launch -c gpt2-train --detach-setup gpt2-train.yaml --gpu A100 --env BUCKET_NAME=your-bucket-name
```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if we have graphs similar to karpathy's, it will be nice to put them here :) No worries if not.

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the loss for training. The eval figure requires additional dependency which I unfortunately did not installed before the current training.


## Run in a Pipeline

We can also combine the two steps into a single SkyPilot job, and let SkyPilot to handle the dependencies between the two steps. Here is an example of how to do this (replace `your-bucket-name` with your bucket name):
```bash
cat gpt2-data.yaml > gpt2.yaml
echo "---" >> gpt2.yaml
cat gpt2-train.yaml >> gpt2.yaml
sky jobs launch -n gpt2 gpt2.yaml --env BUCKET_NAME=your-bucket-name
```

SkyPilot will first download and process the dataset on a CPU VM and store the
processed data in a GCS bucket. Then, it will launch a GPT-2 training job on a
GPU VM. The training job will train GPT-2 (124M) on the processed data.

26 changes: 26 additions & 0 deletions llm/gpt-2/gpt2-data.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: gpt2-data

envs:
BUCKET_NAME: # Fill in your bucket name

resources:
cpus: 64+

file_mounts:
/cache:
name: $BUCKET_NAME
mode: MOUNT

setup: |
pip install tqdm tiktoken requests datasets
# tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?)
# writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B
# and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
git clone https://github.com/karpathy/llm.c.git || true

run: |
cd llm.c
python dev/data/fineweb.py --version 10B

rsync -Pavz --exclude "datasets/downloads/" ~/.cache/huggingface /cache/
rsync -Pavz dev/data/fineweb10B /cache/
89 changes: 89 additions & 0 deletions llm/gpt-2/gpt2-train.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
name: train

envs:
BUCKET_NAME: # Fill in your bucket name

resources:
accelerators: A100:8
# Use docker image for latest version g++ to enable the compilation of llm.c.
image_id: docker:nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
image_id:
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
# Avoid using docker image for lambda due to the docker is not supported on
# Lambda yet, but the base image works.
- cloud: lambda
image_id: null
- cloud: aws
- cloud: gcp
- cloud: azure
- cloud: fluidstack
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

file_mounts:
~/.cache/huggingface: gs://$BUCKET_NAME
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
# name: $BUCKET_NAME
# mode: COPY

setup: |
cd ~
pip install tqdm tiktoken requests datasets

# install cudnn so we can use FlashAttention and run fast (optional)
# https://developer.nvidia.com/cudnn-downloads
# for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04
if [ -f ./CUDNN_INSTALLED ]; then
echo "cudnn already installed"
else
system=$(lsb_release -si | tr '[:upper:]' '[:lower:]')
# Get version and remove the dot
version=$(lsb_release -sr | tr -d .)
export system_version="${system}${version}"
wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-${system_version}-9.1.1_1.0-1_amd64.deb -O cudnn-installer.deb
sudo dpkg -i cudnn-installer.deb
sudo cp /var/cudnn-local-repo-${system_version}-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/
# Remove problematic kubernetes.list source
sudo apt-get update --allow-releaseinfo-change || true

sudo apt-get -y install cudnn-cuda-12

touch ./CUDNN_INSTALLED
fi

# "install" cudnn-frontend to ~/
sudo apt -y install git
git clone https://github.com/NVIDIA/cudnn-frontend.git || true

# install MPI (optional, if you intend to use multiple GPUs)
# SkyPilot do not install MPI as that requires NCCL which needs to be manually
# installed.
sudo apt install -y openmpi-bin openmpi-doc libopenmpi-dev
# install nccl
pip install nvidia-nccl-cu12
export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/nccl2/lib
export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/nccl2/include

git clone https://github.com/karpathy/llm.c.git || true
cd llm.c
ln -s ~/.cache/huggingface/fineweb10B dev/data/
# compile llm.c (mixed precision, with cuDNN flash-attention)
# first compilation is ~1 minute, mostly due to cuDNN
make train_gpt2cu USE_CUDNN=1


run: |
cd ~/llm.c
# train on multiple GPUs
mpirun -np $SKYPILOT_NUM_GPUS_PER_NODE --allow-run-as-root ./train_gpt2cu \
-i "dev/data/fineweb10B/fineweb_train_*.bin" \
-j "dev/data/fineweb10B/fineweb_val_*.bin" \
-o log124M \
-e "d12" \
-b 64 -t 1024 \
-d 524288 \
-r 1 \
-z 1 \
-c 0.1 \
-l 0.0006 \
-q 0.0 \
-u 700 \
-n 5000 \
-v 250 -s 20000 \
-h 1
6 changes: 4 additions & 2 deletions sky/utils/command_runner.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,8 @@ class CommandRunner:
*,
up: bool,
log_path: str = ...,
stream_logs: bool = ...) -> None:
stream_logs: bool = ...,
max_retry: int = 1) -> None:
...

@classmethod
Expand Down Expand Up @@ -191,5 +192,6 @@ class SSHCommandRunner(CommandRunner):
*,
up: bool,
log_path: str = ...,
stream_logs: bool = ...) -> None:
stream_logs: bool = ...,
max_retry: int = 1) -> None:
...
Loading