-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLM] llm.c training for GPT 2 #3611
Merged
Merged
Changes from all commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
374f0f2
add gpt-2 example
Michaelvll 79323f7
Use ubuntu for GCP
Michaelvll 03623ee
fix ncl
Michaelvll 3636ea6
Fix GPT-2
Michaelvll 1694ecd
add train and data
Michaelvll 2c80dcb
use 8 gpus
Michaelvll 1bef798
revert gcp change
Michaelvll 9282873
update readme
Michaelvll 0ee942c
Add GCP image
Michaelvll 5af0d93
make file_mounts more general
Michaelvll 71bcdd0
avoid any_of
Michaelvll 488347f
change back to use ubuntu image with wait for GPU
Michaelvll 8ec06a8
Merge branch 'gpt-2' of https://github.com/skypilot-org/skypilot into…
Michaelvll 2e5bacf
wait cuda installation
Michaelvll c070da0
Add retry for file mount and use env for bucket name
Michaelvll 87d2a3c
revert retries
Michaelvll d6e9554
update the image
Michaelvll 0c2d799
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
Michaelvll ef26ecd
change to docker for better dependency
Michaelvll 2b0a085
revert changes in gcp template
Michaelvll aa8ecfe
avoid using docker on lambda
Michaelvll 265e43c
Add single GPU
Michaelvll 598dca5
Elaborate readme
Michaelvll 3056c2c
Update llm/gpt-2/README.md
Michaelvll 815d23c
fix
Michaelvll faf63d8
Merge branch 'gpt-2' of https://github.com/skypilot-org/skypilot into…
Michaelvll 4c44935
address comments
Michaelvll 3b7312e
Fix data fetching
Michaelvll b6566d7
Add visualization
Michaelvll bea72d5
update
Michaelvll 8887435
reduce cpu cost
Michaelvll 7609990
update loss curve
Michaelvll File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
# Run GPT-2 in llm.c on any cloud with SkyPilot | ||
|
||
This is a reproducible package of llm.c's GPT-2 (124M) training by @karpathy (https://github.com/karpathy/llm.c/discussions/481) | ||
With SkyPilot, you can run GPT-2 (124M) training on any cloud. | ||
|
||
## Prerequisites | ||
|
||
1. Install [SkyPilot](https://github.com/skypilot-org/skypilot): | ||
```bash | ||
pip install "skypilot-nightly[aws,gcp,azure,kubernetes,lambda,fluidstack]" # Choose the clouds you want to enable | ||
``` | ||
2. Enable clouds for SkyPilot: | ||
```bash | ||
sky check | ||
``` | ||
Please check the instructions for enabling clouds at [SkyPilot doc](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html). | ||
|
||
3. Download the YAML for starting the training: | ||
```bash | ||
wget https://raw.githubusercontent.com/skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2.yaml | ||
``` | ||
|
||
## Run GPT-2 training | ||
|
||
Run the following command to start GPT-2 (124M) training on a GPU VM with 8 A100 GPUs (replace `your-bucket-name` with your bucket name): | ||
|
||
```bash | ||
sky launch -c gpt2 gpt2.yaml | ||
``` | ||
|
||
![GPT-2 training with 8 A100 GPUs](https://imgur.com/v8SGpsF.png) | ||
|
||
Or, you can train the model with a single A100, by adding `--gpu A100`: | ||
```bash | ||
sky launch -c gpt2 gpt2.yaml --gpu A100 | ||
``` | ||
|
||
![GPT-2 training with a single A100](https://imgur.com/hN65g4r.png) | ||
|
||
### Download logs and visualizations | ||
|
||
After the training is finished, you can download the logs and visualizations with the following command: | ||
```bash | ||
scp -r gpt2:~/llm.c/log124M . | ||
``` | ||
We can visualize the training progress with the notebook provided in [llm.c](https://github.com/karpathy/llm.c/blob/master/dev/vislog.ipynb). (Note: we cut off the training after 10K steps, which already achieve similar validation loss as OpenAI GPT-2 checkpoint.) | ||
|
||
<div align="center"> | ||
<img src="https://imgur.com/lskPEAQ.png" width="60%"> | ||
</div> | ||
|
||
> Yes! We are able to reproduce the training of GPT-2 (124M) on any cloud with SkyPilot. | ||
|
||
|
||
|
||
## Advanced: Run GPT-2 training in two stages | ||
|
||
The data processing for GPT-2 training is CPU-bound, while the training is GPU-bound. Having the data processing on a GPU VM is not cost-effective. With SkyPilot, you can easily | ||
separate the data processing and training into two stages and execute them sequantially manually, or let SkyPilot manage the dependencies between the two stages. | ||
|
||
With this data processing can be run on cheaper CPU VMs (e.g., ~\$0.4/hour), and run the training on more expensive GPU VMs (e.g., ~\$1.3-\$3.6/hour for a single A100 GPU, or \$10.3-\$32.8/hour for 8 A100 GPUs). | ||
|
||
We can run the data processing on a CPU VM and store the processed data in a cloud bucket. Then, we can run the training on a GPU VM with the processed data. | ||
|
||
```bash | ||
wget https://raw.githubusercontent.com//skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2-data.yaml | ||
wget https://raw.githubusercontent.com/skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2-train.yaml | ||
``` | ||
|
||
### Run two stages manually | ||
#### Data processing | ||
|
||
Run the following command to process the training data on a CPU VM and store it in a cloud bucket for future use (replace `your-bucket-name` with your bucket name): | ||
|
||
```bash | ||
sky launch -c gpt2-data gpt2-data.yaml --env BUCKET_NAME=your-bucket-name | ||
``` | ||
|
||
|
||
#### Training | ||
|
||
After the data is processed, you can then train the model on a GPU VM with 8 A100 GPUs (replace `your-bucket-name` with your bucket name): | ||
|
||
```bash | ||
sky launch -c gpt2-train --detach-setup gpt2-train.yaml --env BUCKET_NAME=your-bucket-name | ||
``` | ||
|
||
Or, you can train the model with a single A100, by adding `--gpu A100`: | ||
```bash | ||
sky launch -c gpt2-train --detach-setup gpt2-train.yaml --gpu A100 --env BUCKET_NAME=your-bucket-name | ||
``` | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
### Run in a Pipeline | ||
|
||
We can also combine the two steps into a single SkyPilot job, and let SkyPilot to handle the dependencies between the two steps. Here is an example of how to do this (replace `your-bucket-name` with your bucket name): | ||
```bash | ||
cat gpt2-data.yaml > gpt2-pipeline.yaml | ||
echo "---" >> gpt2-pipeline.yaml | ||
cat gpt2-train.yaml >> gpt2-pipeline.yaml | ||
sky jobs launch -n gpt2 gpt2-pipeline.yaml --env BUCKET_NAME=your-bucket-name | ||
``` | ||
|
||
SkyPilot will first download and process the dataset on a CPU VM and store the | ||
processed data in a GCS bucket. Then, it will launch a GPT-2 training job on a | ||
GPU VM. The training job will train GPT-2 (124M) on the processed data. | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
name: gpt2-data | ||
|
||
envs: | ||
BUCKET_NAME: # Fill in your bucket name | ||
|
||
resources: | ||
cpus: 8+ | ||
|
||
file_mounts: | ||
/cache: | ||
name: $BUCKET_NAME | ||
mode: MOUNT | ||
|
||
setup: | | ||
pip install tqdm tiktoken requests datasets | ||
git clone https://github.com/karpathy/llm.c.git@ed37d9261ba13ef212c01e2de8b309cbb46a2aa7 || true | ||
|
||
# Adding revision to fix the dataset version, as the latest fineweb | ||
# dataset removed the samples, causing error: | ||
# Please pass `features` or at least one example when writing data | ||
sed -i 's/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train")/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train", revision="9767af12bf8f0f7d3c91e0345b89bc6b9cbe1a94")/' dev/data/fineweb.py | ||
|
||
|
||
run: | | ||
cd llm.c | ||
# tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?) | ||
# writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B | ||
# and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb | ||
python dev/data/fineweb.py --version 10B | ||
|
||
rsync -Pavz --exclude "datasets/downloads/" ~/.cache/huggingface /cache/ | ||
rsync -Pavz dev/data/fineweb10B /cache/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
name: train | ||
|
||
envs: | ||
BUCKET_NAME: # Fill in your bucket name | ||
|
||
resources: | ||
accelerators: A100:8 | ||
# Use docker image for latest version g++ to enable the compilation of llm.c. | ||
image_id: docker:nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 | ||
any_of: | ||
# Avoid using docker image for lambda due to the docker is not supported on | ||
# Lambda yet, but the base image works. | ||
- cloud: lambda | ||
image_id: null | ||
- cloud: aws | ||
- cloud: gcp | ||
- cloud: azure | ||
- cloud: fluidstack | ||
Michaelvll marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- cloud: kubernetes | ||
|
||
file_mounts: | ||
~/.cache/huggingface: | ||
name: $BUCKET_NAME | ||
mode: COPY | ||
|
||
setup: | | ||
cd ~ | ||
pip install tqdm tiktoken requests datasets | ||
|
||
# install cudnn so we can use FlashAttention and run fast (optional) | ||
# https://developer.nvidia.com/cudnn-downloads | ||
# for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04 | ||
if [ -f ./CUDNN_INSTALLED ]; then | ||
echo "cudnn already installed" | ||
else | ||
system=$(lsb_release -si | tr '[:upper:]' '[:lower:]') | ||
# Get version and remove the dot | ||
version=$(lsb_release -sr | tr -d .) | ||
export system_version="${system}${version}" | ||
wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-${system_version}-9.1.1_1.0-1_amd64.deb -O cudnn-installer.deb | ||
sudo dpkg -i cudnn-installer.deb | ||
sudo cp /var/cudnn-local-repo-${system_version}-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/ | ||
# Remove problematic kubernetes.list source | ||
sudo apt-get update --allow-releaseinfo-change || true | ||
|
||
sudo apt-get -y install cudnn-cuda-12 | ||
|
||
touch ./CUDNN_INSTALLED | ||
fi | ||
|
||
# "install" cudnn-frontend to ~/ | ||
sudo apt -y install git | ||
git clone https://github.com/NVIDIA/cudnn-frontend.git || true | ||
|
||
# install MPI (optional, if you intend to use multiple GPUs) | ||
# SkyPilot do not install MPI as that requires NCCL which needs to be manually | ||
# installed. | ||
sudo apt install -y openmpi-bin openmpi-doc libopenmpi-dev | ||
# install nccl | ||
pip install nvidia-nccl-cu12 | ||
export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/nccl2/lib | ||
export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/nccl2/include | ||
|
||
git clone https://github.com/karpathy/llm.c.git || true | ||
cd llm.c | ||
ln -s ~/.cache/huggingface/fineweb10B dev/data/ | ||
# compile llm.c (mixed precision, with cuDNN flash-attention) | ||
# first compilation is ~1 minute, mostly due to cuDNN | ||
make train_gpt2cu USE_CUDNN=1 | ||
|
||
|
||
run: | | ||
cd ~/llm.c | ||
# train on multiple GPUs | ||
mpirun -np $SKYPILOT_NUM_GPUS_PER_NODE --allow-run-as-root ./train_gpt2cu \ | ||
-i "dev/data/fineweb10B/fineweb_train_*.bin" \ | ||
-j "dev/data/fineweb10B/fineweb_val_*.bin" \ | ||
-o log124M \ | ||
-e "d12" \ | ||
-b 64 -t 1024 \ | ||
-d 524288 \ | ||
-r 1 \ | ||
-z 1 \ | ||
-c 0.1 \ | ||
-l 0.0006 \ | ||
-q 0.0 \ | ||
-u 700 \ | ||
-n 5000 \ | ||
-v 250 -s 20000 \ | ||
-h 1 | ||
|
||
# Upload the log and model to the bucket | ||
rsync -Pavz log124M ~/.cache/huggingface |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
name: train | ||
|
||
resources: | ||
accelerators: A100:8 | ||
# Use docker image for latest version g++ to enable the compilation of llm.c. | ||
image_id: docker:nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 | ||
any_of: | ||
# Avoid using docker image for lambda due to the docker is not supported on | ||
# Lambda yet, but the base image works. | ||
- cloud: lambda | ||
image_id: null | ||
- cloud: aws | ||
- cloud: gcp | ||
- cloud: azure | ||
- cloud: fluidstack | ||
- cloud: kubernetes | ||
|
||
|
||
setup: | | ||
cd ~ | ||
pip install tqdm tiktoken requests datasets | ||
|
||
# Training dependencies | ||
# install cudnn so we can use FlashAttention and run fast (optional) | ||
# https://developer.nvidia.com/cudnn-downloads | ||
# for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04 | ||
if [ -f ./CUDNN_INSTALLED ]; then | ||
echo "cudnn already installed" | ||
else | ||
system=$(lsb_release -si | tr '[:upper:]' '[:lower:]') | ||
# Get version and remove the dot | ||
version=$(lsb_release -sr | tr -d .) | ||
export system_version="${system}${version}" | ||
wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-${system_version}-9.1.1_1.0-1_amd64.deb -O cudnn-installer.deb | ||
sudo dpkg -i cudnn-installer.deb | ||
sudo cp /var/cudnn-local-repo-${system_version}-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/ | ||
# Remove problematic kubernetes.list source | ||
sudo apt-get update --allow-releaseinfo-change || true | ||
|
||
sudo apt-get -y install cudnn-cuda-12 | ||
|
||
touch ./CUDNN_INSTALLED | ||
fi | ||
|
||
# "install" cudnn-frontend to ~/ | ||
sudo apt -y install git | ||
git clone https://github.com/NVIDIA/cudnn-frontend.git || true | ||
|
||
# install MPI (optional, if you intend to use multiple GPUs) | ||
# SkyPilot do not install MPI as that requires NCCL which needs to be manually | ||
# installed. | ||
sudo apt install -y openmpi-bin openmpi-doc libopenmpi-dev | ||
# install nccl | ||
pip install nvidia-nccl-cu12 | ||
export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/nccl2/lib | ||
export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/nccl2/include | ||
|
||
git clone https://github.com/karpathy/llm.c.git || true | ||
cd llm.c | ||
|
||
# add revision to fix the dataset version, as the latest fineweb | ||
# dataset removed the samples, causing error: | ||
# Please pass `features` or at least one example when writing data | ||
sed -i 's/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train")/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train", revision="9767af12bf8f0f7d3c91e0345b89bc6b9cbe1a94")/' dev/data/fineweb.py | ||
|
||
# compile llm.c (mixed precision, with cuDNN flash-attention) | ||
# first compilation is ~1 minute, mostly due to cuDNN | ||
make train_gpt2cu USE_CUDNN=1 | ||
|
||
|
||
run: | | ||
cd ~/llm.c | ||
# Processing data | ||
# tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?) | ||
# writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B | ||
# and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb | ||
python dev/data/fineweb.py --version 10B | ||
|
||
# Start training on multiple GPUs | ||
mpirun -np $SKYPILOT_NUM_GPUS_PER_NODE --allow-run-as-root ./train_gpt2cu \ | ||
-i "dev/data/fineweb10B/fineweb_train_*.bin" \ | ||
-j "dev/data/fineweb10B/fineweb_val_*.bin" \ | ||
-o log124M \ | ||
-e "d12" \ | ||
-b 64 -t 1024 \ | ||
-d 524288 \ | ||
-r 1 \ | ||
-z 1 \ | ||
-c 0.1 \ | ||
-l 0.0006 \ | ||
-q 0.0 \ | ||
-u 700 \ | ||
-n 5000 \ | ||
-v 250 -s 20000 \ | ||
-h 1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since cost + time seems like a big motivation behind this work ("Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20"), should we mention that here? Perhaps we can show the optimizer output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Added the comparison in the sentence. How does it look to you?