Skip to content

Commit

Permalink
Add visualization
Browse files Browse the repository at this point in the history
  • Loading branch information
Michaelvll committed May 31, 2024
1 parent 3b7312e commit b6566d7
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 0 deletions.
16 changes: 16 additions & 0 deletions llm/gpt-2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,20 @@ sky launch -c gpt2 gpt2.yaml --gpu A100

![GPT-2 training with a single A100](https://imgur.com/hN65g4r.png)

### ## Download logs and visualizations

After the training is finished, you can download the logs and visualizations with the following command:
```bash
scp -r gpt2:~/llm.c/log124M .
```
We can visualize the training progress with the notebook provided in [llm.c](https://github.com/karpathy/llm.c/blob/master/dev/vislog.ipynb). (Note: we cut off the training after 8000 steps, which already achieve similar validation loss as OpenAI GPT-2 checkpoint.)

![Training progress](https://imgur.com/qeNNlIB.png)

> Yes! We are able to reproduce the training of GPT-2 (124M) on any cloud with SkyPilot.


## Advanced: Run GPT-2 training in two stages

The data processing for GPT-2 training is CPU-bound, while the training is GPU-bound. Having the data processing on a GPU VM is not cost-effective. With SkyPilot, you can easily
Expand Down Expand Up @@ -89,3 +103,5 @@ SkyPilot will first download and process the dataset on a CPU VM and store the
processed data in a GCS bucket. Then, it will launch a GPT-2 training job on a
GPU VM. The training job will train GPT-2 (124M) on the processed data.



3 changes: 3 additions & 0 deletions llm/gpt-2/gpt2-train.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -88,3 +88,6 @@ run: |
-n 5000 \
-v 250 -s 20000 \
-h 1
# Upload the log and model to the bucket
rsync -Pavz log124M ~/.cache/huggingface

0 comments on commit b6566d7

Please sign in to comment.