Add a benchmarking + profiling example to the mila docs #247

lebrice · 2024-06-26T15:17:01Z

Create a new example based on the code of the Wandb example
Modify the code to use the ImageNet dataset instead of cifar10 (ask @lebrice for the optimized imagenet recipe for the Mila cluster)
Demonstrate how to diagnose whether the dataloading is the bottleneck in the code (compare throughput with/without training)
Once we know that the dataloading is not the bottleneck anymore, show the user how to see the GPU utilization metrics from the wandb run page (the GPU utilization would ideally be pretty low here)
...
show how to use the pytorch profiler to find a (perhaps artifical) bottleneck in the model code. For example, by making a part of the code use much more VRAM than is required, or perform needless copies, etc. just to demonstrate the idea)
- The tutorial should instruct people on how to visually inspect the pytorch profiler output window to identify the bottleneck. Ask @obilaniu for some tips on how to do this as needed.
Show how the output of the profiler changes once this last bottleneck is fixed. Give hints as to how to keep identifying the next bottleneck, and potential avenues for further optimization (for example using something like torch.compile, or more workers, multiple GPUs, etc.)
Create a WandB report showing the metrics before/during/after this optimization/profiling process.
- This report is totally unrelated, and has way too many plots, but just to give an idea of the kind of plots you can easily create with wandb: https://wandb.ai/lebrice/llm_training/reports/Narval-LLM-Training-Tests--VmlldzozOTI2MDAz

lebrice mentioned this issue Jun 26, 2024

Add a Benchmarking / profiling example mila-iqia/ResearchTemplate#11

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a benchmarking + profiling example to the mila docs #247

Add a benchmarking + profiling example to the mila docs #247

lebrice commented Jun 26, 2024 •

edited

Loading

Add a benchmarking + profiling example to the mila docs #247

Add a benchmarking + profiling example to the mila docs #247

Comments

lebrice commented Jun 26, 2024 • edited Loading

lebrice commented Jun 26, 2024 •

edited

Loading