-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated scalability report (more comprehensive and easier to use) #221
Comments
I usually think to GPU utilization as the % of GPU in use. returned by nvidia-smi (or similar), which has no unit of measurement, but could be converted into FLOPS/sec knowing the GPU's peak FLOPS. On the other hand, what you measure with the profiler gives you a break down of the compute time (communication, ops, I/O), which is useful to study scalability and find bottlenecks (for when we'll want to have a more "active" attitude towards scalability) Wall clock time gives a nice overview (e.g., avg epoch time). Regarding the report, I would suggest having a look at the tensorboard integration: https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html Also, other interesting references: |
Summary
The scalability report that you get from
itwinai scalability_report
shows the relative speedup between running a job on a single node vs. on multiple nodes. However, this is not very comprehensive and has some issues with the user experience.Metrics that should be included
Other improvements
The text was updated successfully, but these errors were encountered: