-
Notifications
You must be signed in to change notification settings - Fork 239
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This PR adds experimental wandb support, not sure this is "landable" considering y'all uses tensorboard by default. Personally I vastly prefer wandb because I can share my training runs with a link and don't need to muck around with ssh tunneling so I'm just opening this since I'm using it myself. If there's interest I can work to land this. To use this you just kick of a training as usual with `CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh ` but also run `wandb login` and paste in your token ![Screenshot 2024-11-25 at 12 16 20 PM](https://github.com/user-attachments/assets/4d8c3893-2bb2-435e-b9bb-69558b8ea7ea) Changes in logs will look like ![Screenshot 2024-11-25 at 12 27 42 PM](https://github.com/user-attachments/assets/24760ef3-21d5-4292-bdef-cddb5e916e6b) Also only slightly related but llama 3 tokenizer is not available on hf anymore so added instructions for 3.1 and 3.2 <details> <summary>Click here for detailed logs.</summary> [rank0]:2024-11-25 11:33:24,320 - root - INFO - Dumping traces at step 1000 [rank0]:2024-11-25 11:33:24,576 - root - INFO - Finished dumping traces in 0.26 seconds [rank0]:2024-11-25 11:33:24,577 - root - INFO - Sleeping 2 seconds for other ranks to complete [rank0]:wandb: [rank0]:wandb: [rank0]:wandb: Run history: [rank0]:wandb: loss_metrics/global_avg_loss █▆▅▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ [rank0]:wandb: loss_metrics/global_max_loss █▇▄▄▃▃▄▃▃▆▃▃▃▃▃▃▂▂▂▂▃▂▂▃▁▂▂▂▁▃▂▁▂▁▂▂▁▄▁▁ [rank0]:wandb: memory/max_active(%) ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ [rank0]:wandb: memory/max_active(GiB) ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ [rank0]:wandb: memory/max_reserved(%) ▁███████████████████████████████████████ [rank0]:wandb: memory/max_reserved(GiB) ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ [rank0]:wandb: memory/num_alloc_retries ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ [rank0]:wandb: memory/num_ooms ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ [rank0]:wandb: mfu(%) ▁███████▇██████▇█████████▇█▇████████████ [rank0]:wandb: step ▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▆▆▆▆▇▇▇▇▇▇▇████ [rank0]:wandb: time_metrics/data_loading(%) ▁▁▁▁▂▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▁▁▂▁▁▁▂ [rank0]:wandb: time_metrics/data_loading(s) ▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ [rank0]:wandb: time_metrics/end_to_end(s) ▁▇▇▇▇█▇▇▇█▇▇▇▇▇▇▇▇▇▇▇▇██▇▇▇█▇▇▇▇▇▇█▇█▇▇▇ [rank0]:wandb: wps ███▁████▄█▇▅████████▅▄████▇███▇▄████▇██▇ [rank0]:wandb: [rank0]:wandb: Run summary: [rank0]:wandb: loss_metrics/global_avg_loss 4.53519 [rank0]:wandb: loss_metrics/global_max_loss 4.99517 [rank0]:wandb: memory/max_active(%) 43.33611 [rank0]:wandb: memory/max_active(GiB) 41.17145 [rank0]:wandb: memory/max_reserved(%) 52.19301 [rank0]:wandb: memory/max_reserved(GiB) 49.58594 [rank0]:wandb: memory/num_alloc_retries 0 [rank0]:wandb: memory/num_ooms 0 [rank0]:wandb: mfu(%) 30.75216 [rank0]:wandb: step 1000 [rank0]:wandb: time_metrics/data_loading(%) 1.01461 [rank0]:wandb: time_metrics/data_loading(s) 0.01583 [rank0]:wandb: time_metrics/end_to_end(s) 1.55993 [rank0]:wandb: wps 5251.52034 [rank0]:wandb: [rank0]:wandb: 🚀 View run skilled-glitter-1 at: https://wandb.ai/sahancpal-meta/torchtitan/runs/r1zqr75b </details> --------- Co-authored-by: tianyu-l <[email protected]>
- Loading branch information
Showing
6 changed files
with
161 additions
and
66 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,3 +6,4 @@ sentencepiece | |
tiktoken | ||
blobfile | ||
tabulate | ||
wandb |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# Metrics | ||
|
||
We support automatically collecting metrics such as | ||
1. High level system metrics such as MFU, average loss, max loss and words per second along with some | ||
2. Memory metrics to measure max VRAM consumption and the number of OOMs | ||
3. Timing metrics to measure data loading bottlenecks | ||
|
||
Those metrics can then be visualized in either a TensorBoard or WandDB dashboard | ||
|
||
## TensorBoard | ||
|
||
To visualize TensorBoard metrics of models trained on a remote server via a local web browser: | ||
|
||
1. Make sure `metrics.enable_tensorboard` option is set to true in model training (either from a .toml file or from CLI). | ||
|
||
2. Set up SSH tunneling, by running the following from local CLI | ||
``` | ||
ssh -L 6006:127.0.0.1:6006 [username]@[hostname] | ||
``` | ||
|
||
3. Inside the SSH tunnel that logged into the remote server, go to the torchtitan repo, and start the TensorBoard backend | ||
``` | ||
tensorboard --logdir=./outputs/tb | ||
``` | ||
|
||
4. In the local web browser, go to the URL it provides OR to http://localhost:6006/. | ||
|
||
## Weights and Biases | ||
|
||
Weights and Biases will automatically send metrics to a remote server if you login with `wandb login` | ||
|
||
So all you need to do is make sure that `metrics.enable_wandb` is enabled | ||
|
||
For an example you can inspect [debug_model.toml](../train_configs/debug_model.toml) | ||
|
||
Note that if both W&B and Tensorboard are enabled then we will prioritize W&B. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters