-
Notifications
You must be signed in to change notification settings - Fork 532
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[LLM] llm.c training for GPT 2 (#3611)
* add gpt-2 example * Use ubuntu for GCP * fix ncl * Fix GPT-2 * add train and data * use 8 gpus * revert gcp change * update readme * Add GCP image * make file_mounts more general * avoid any_of * change back to use ubuntu image with wait for GPU * wait cuda installation * Add retry for file mount and use env for bucket name * revert retries * update the image * change to docker for better dependency * revert changes in gcp template * avoid using docker on lambda * Add single GPU * Elaborate readme * Update llm/gpt-2/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * fix * address comments * Fix data fetching * Add visualization * update * reduce cpu cost * update loss curve --------- Co-authored-by: Romil Bhardwaj <[email protected]>
- Loading branch information
1 parent
3d03d3a
commit ea5aa50
Showing
6 changed files
with
335 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
# Run GPT-2 in llm.c on any cloud with SkyPilot | ||
|
||
This is a reproducible package of llm.c's GPT-2 (124M) training by @karpathy (https://github.com/karpathy/llm.c/discussions/481) | ||
With SkyPilot, you can run GPT-2 (124M) training on any cloud. | ||
|
||
## Prerequisites | ||
|
||
1. Install [SkyPilot](https://github.com/skypilot-org/skypilot): | ||
```bash | ||
pip install "skypilot-nightly[aws,gcp,azure,kubernetes,lambda,fluidstack]" # Choose the clouds you want to enable | ||
``` | ||
2. Enable clouds for SkyPilot: | ||
```bash | ||
sky check | ||
``` | ||
Please check the instructions for enabling clouds at [SkyPilot doc](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html). | ||
|
||
3. Download the YAML for starting the training: | ||
```bash | ||
wget https://raw.githubusercontent.com/skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2.yaml | ||
``` | ||
|
||
## Run GPT-2 training | ||
|
||
Run the following command to start GPT-2 (124M) training on a GPU VM with 8 A100 GPUs (replace `your-bucket-name` with your bucket name): | ||
|
||
```bash | ||
sky launch -c gpt2 gpt2.yaml | ||
``` | ||
|
||
![GPT-2 training with 8 A100 GPUs](https://imgur.com/v8SGpsF.png) | ||
|
||
Or, you can train the model with a single A100, by adding `--gpu A100`: | ||
```bash | ||
sky launch -c gpt2 gpt2.yaml --gpu A100 | ||
``` | ||
|
||
![GPT-2 training with a single A100](https://imgur.com/hN65g4r.png) | ||
|
||
### Download logs and visualizations | ||
|
||
After the training is finished, you can download the logs and visualizations with the following command: | ||
```bash | ||
scp -r gpt2:~/llm.c/log124M . | ||
``` | ||
We can visualize the training progress with the notebook provided in [llm.c](https://github.com/karpathy/llm.c/blob/master/dev/vislog.ipynb). (Note: we cut off the training after 10K steps, which already achieve similar validation loss as OpenAI GPT-2 checkpoint.) | ||
|
||
<div align="center"> | ||
<img src="https://imgur.com/lskPEAQ.png" width="60%"> | ||
</div> | ||
|
||
> Yes! We are able to reproduce the training of GPT-2 (124M) on any cloud with SkyPilot. | ||
|
||
|
||
## Advanced: Run GPT-2 training in two stages | ||
|
||
The data processing for GPT-2 training is CPU-bound, while the training is GPU-bound. Having the data processing on a GPU VM is not cost-effective. With SkyPilot, you can easily | ||
separate the data processing and training into two stages and execute them sequantially manually, or let SkyPilot manage the dependencies between the two stages. | ||
|
||
With this data processing can be run on cheaper CPU VMs (e.g., ~\$0.4/hour), and run the training on more expensive GPU VMs (e.g., ~\$1.3-\$3.6/hour for a single A100 GPU, or \$10.3-\$32.8/hour for 8 A100 GPUs). | ||
|
||
We can run the data processing on a CPU VM and store the processed data in a cloud bucket. Then, we can run the training on a GPU VM with the processed data. | ||
|
||
```bash | ||
wget https://raw.githubusercontent.com//skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2-data.yaml | ||
wget https://raw.githubusercontent.com/skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2-train.yaml | ||
``` | ||
|
||
### Run two stages manually | ||
#### Data processing | ||
|
||
Run the following command to process the training data on a CPU VM and store it in a cloud bucket for future use (replace `your-bucket-name` with your bucket name): | ||
|
||
```bash | ||
sky launch -c gpt2-data gpt2-data.yaml --env BUCKET_NAME=your-bucket-name | ||
``` | ||
|
||
|
||
#### Training | ||
|
||
After the data is processed, you can then train the model on a GPU VM with 8 A100 GPUs (replace `your-bucket-name` with your bucket name): | ||
|
||
```bash | ||
sky launch -c gpt2-train --detach-setup gpt2-train.yaml --env BUCKET_NAME=your-bucket-name | ||
``` | ||
|
||
Or, you can train the model with a single A100, by adding `--gpu A100`: | ||
```bash | ||
sky launch -c gpt2-train --detach-setup gpt2-train.yaml --gpu A100 --env BUCKET_NAME=your-bucket-name | ||
``` | ||
|
||
|
||
### Run in a Pipeline | ||
|
||
We can also combine the two steps into a single SkyPilot job, and let SkyPilot to handle the dependencies between the two steps. Here is an example of how to do this (replace `your-bucket-name` with your bucket name): | ||
```bash | ||
cat gpt2-data.yaml > gpt2-pipeline.yaml | ||
echo "---" >> gpt2-pipeline.yaml | ||
cat gpt2-train.yaml >> gpt2-pipeline.yaml | ||
sky jobs launch -n gpt2 gpt2-pipeline.yaml --env BUCKET_NAME=your-bucket-name | ||
``` | ||
|
||
SkyPilot will first download and process the dataset on a CPU VM and store the | ||
processed data in a GCS bucket. Then, it will launch a GPT-2 training job on a | ||
GPU VM. The training job will train GPT-2 (124M) on the processed data. | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
name: gpt2-data | ||
|
||
envs: | ||
BUCKET_NAME: # Fill in your bucket name | ||
|
||
resources: | ||
cpus: 8+ | ||
|
||
file_mounts: | ||
/cache: | ||
name: $BUCKET_NAME | ||
mode: MOUNT | ||
|
||
setup: | | ||
pip install tqdm tiktoken requests datasets | ||
git clone https://github.com/karpathy/llm.c.git@ed37d9261ba13ef212c01e2de8b309cbb46a2aa7 || true | ||
# Adding revision to fix the dataset version, as the latest fineweb | ||
# dataset removed the samples, causing error: | ||
# Please pass `features` or at least one example when writing data | ||
sed -i 's/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train")/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train", revision="9767af12bf8f0f7d3c91e0345b89bc6b9cbe1a94")/' dev/data/fineweb.py | ||
run: | | ||
cd llm.c | ||
# tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?) | ||
# writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B | ||
# and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb | ||
python dev/data/fineweb.py --version 10B | ||
rsync -Pavz --exclude "datasets/downloads/" ~/.cache/huggingface /cache/ | ||
rsync -Pavz dev/data/fineweb10B /cache/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
name: train | ||
|
||
envs: | ||
BUCKET_NAME: # Fill in your bucket name | ||
|
||
resources: | ||
accelerators: A100:8 | ||
# Use docker image for latest version g++ to enable the compilation of llm.c. | ||
image_id: docker:nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 | ||
any_of: | ||
# Avoid using docker image for lambda due to the docker is not supported on | ||
# Lambda yet, but the base image works. | ||
- cloud: lambda | ||
image_id: null | ||
- cloud: aws | ||
- cloud: gcp | ||
- cloud: azure | ||
- cloud: fluidstack | ||
- cloud: kubernetes | ||
|
||
file_mounts: | ||
~/.cache/huggingface: | ||
name: $BUCKET_NAME | ||
mode: COPY | ||
|
||
setup: | | ||
cd ~ | ||
pip install tqdm tiktoken requests datasets | ||
# install cudnn so we can use FlashAttention and run fast (optional) | ||
# https://developer.nvidia.com/cudnn-downloads | ||
# for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04 | ||
if [ -f ./CUDNN_INSTALLED ]; then | ||
echo "cudnn already installed" | ||
else | ||
system=$(lsb_release -si | tr '[:upper:]' '[:lower:]') | ||
# Get version and remove the dot | ||
version=$(lsb_release -sr | tr -d .) | ||
export system_version="${system}${version}" | ||
wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-${system_version}-9.1.1_1.0-1_amd64.deb -O cudnn-installer.deb | ||
sudo dpkg -i cudnn-installer.deb | ||
sudo cp /var/cudnn-local-repo-${system_version}-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/ | ||
# Remove problematic kubernetes.list source | ||
sudo apt-get update --allow-releaseinfo-change || true | ||
sudo apt-get -y install cudnn-cuda-12 | ||
touch ./CUDNN_INSTALLED | ||
fi | ||
# "install" cudnn-frontend to ~/ | ||
sudo apt -y install git | ||
git clone https://github.com/NVIDIA/cudnn-frontend.git || true | ||
# install MPI (optional, if you intend to use multiple GPUs) | ||
# SkyPilot do not install MPI as that requires NCCL which needs to be manually | ||
# installed. | ||
sudo apt install -y openmpi-bin openmpi-doc libopenmpi-dev | ||
# install nccl | ||
pip install nvidia-nccl-cu12 | ||
export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/nccl2/lib | ||
export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/nccl2/include | ||
git clone https://github.com/karpathy/llm.c.git || true | ||
cd llm.c | ||
ln -s ~/.cache/huggingface/fineweb10B dev/data/ | ||
# compile llm.c (mixed precision, with cuDNN flash-attention) | ||
# first compilation is ~1 minute, mostly due to cuDNN | ||
make train_gpt2cu USE_CUDNN=1 | ||
run: | | ||
cd ~/llm.c | ||
# train on multiple GPUs | ||
mpirun -np $SKYPILOT_NUM_GPUS_PER_NODE --allow-run-as-root ./train_gpt2cu \ | ||
-i "dev/data/fineweb10B/fineweb_train_*.bin" \ | ||
-j "dev/data/fineweb10B/fineweb_val_*.bin" \ | ||
-o log124M \ | ||
-e "d12" \ | ||
-b 64 -t 1024 \ | ||
-d 524288 \ | ||
-r 1 \ | ||
-z 1 \ | ||
-c 0.1 \ | ||
-l 0.0006 \ | ||
-q 0.0 \ | ||
-u 700 \ | ||
-n 5000 \ | ||
-v 250 -s 20000 \ | ||
-h 1 | ||
# Upload the log and model to the bucket | ||
rsync -Pavz log124M ~/.cache/huggingface |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
name: train | ||
|
||
resources: | ||
accelerators: A100:8 | ||
# Use docker image for latest version g++ to enable the compilation of llm.c. | ||
image_id: docker:nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 | ||
any_of: | ||
# Avoid using docker image for lambda due to the docker is not supported on | ||
# Lambda yet, but the base image works. | ||
- cloud: lambda | ||
image_id: null | ||
- cloud: aws | ||
- cloud: gcp | ||
- cloud: azure | ||
- cloud: fluidstack | ||
- cloud: kubernetes | ||
|
||
|
||
setup: | | ||
cd ~ | ||
pip install tqdm tiktoken requests datasets | ||
# Training dependencies | ||
# install cudnn so we can use FlashAttention and run fast (optional) | ||
# https://developer.nvidia.com/cudnn-downloads | ||
# for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04 | ||
if [ -f ./CUDNN_INSTALLED ]; then | ||
echo "cudnn already installed" | ||
else | ||
system=$(lsb_release -si | tr '[:upper:]' '[:lower:]') | ||
# Get version and remove the dot | ||
version=$(lsb_release -sr | tr -d .) | ||
export system_version="${system}${version}" | ||
wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-${system_version}-9.1.1_1.0-1_amd64.deb -O cudnn-installer.deb | ||
sudo dpkg -i cudnn-installer.deb | ||
sudo cp /var/cudnn-local-repo-${system_version}-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/ | ||
# Remove problematic kubernetes.list source | ||
sudo apt-get update --allow-releaseinfo-change || true | ||
sudo apt-get -y install cudnn-cuda-12 | ||
touch ./CUDNN_INSTALLED | ||
fi | ||
# "install" cudnn-frontend to ~/ | ||
sudo apt -y install git | ||
git clone https://github.com/NVIDIA/cudnn-frontend.git || true | ||
# install MPI (optional, if you intend to use multiple GPUs) | ||
# SkyPilot do not install MPI as that requires NCCL which needs to be manually | ||
# installed. | ||
sudo apt install -y openmpi-bin openmpi-doc libopenmpi-dev | ||
# install nccl | ||
pip install nvidia-nccl-cu12 | ||
export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/nccl2/lib | ||
export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/nccl2/include | ||
git clone https://github.com/karpathy/llm.c.git || true | ||
cd llm.c | ||
# add revision to fix the dataset version, as the latest fineweb | ||
# dataset removed the samples, causing error: | ||
# Please pass `features` or at least one example when writing data | ||
sed -i 's/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train")/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train", revision="9767af12bf8f0f7d3c91e0345b89bc6b9cbe1a94")/' dev/data/fineweb.py | ||
# compile llm.c (mixed precision, with cuDNN flash-attention) | ||
# first compilation is ~1 minute, mostly due to cuDNN | ||
make train_gpt2cu USE_CUDNN=1 | ||
run: | | ||
cd ~/llm.c | ||
# Processing data | ||
# tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?) | ||
# writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B | ||
# and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb | ||
python dev/data/fineweb.py --version 10B | ||
# Start training on multiple GPUs | ||
mpirun -np $SKYPILOT_NUM_GPUS_PER_NODE --allow-run-as-root ./train_gpt2cu \ | ||
-i "dev/data/fineweb10B/fineweb_train_*.bin" \ | ||
-j "dev/data/fineweb10B/fineweb_val_*.bin" \ | ||
-o log124M \ | ||
-e "d12" \ | ||
-b 64 -t 1024 \ | ||
-d 524288 \ | ||
-r 1 \ | ||
-z 1 \ | ||
-c 0.1 \ | ||
-l 0.0006 \ | ||
-q 0.0 \ | ||
-u 700 \ | ||
-n 5000 \ | ||
-v 250 -s 20000 \ | ||
-h 1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters