Updated scalability report (more comprehensive and easier to use) #221

jarlsondre · 2024-09-27T10:08:05Z

Summary

The scalability report that you get from itwinai scalability_report shows the relative speedup between running a job on a single node vs. on multiple nodes. However, this is not very comprehensive and has some issues with the user experience.

Metrics that should be included

Throughput
- FLOPS, samples/sec, or just total runtime (as this will be inversely proportional)
Communication Overhead
- How much time is spent performing communication vs. actual computation
GPU Utilization
- How much of the GPUs is utilized e.g. on average (as a score from 0 to 100%)

Other improvements

Allowing to pass a folder without needing a RegEx pattern
- Could still allow the use of RegEx, but if you know that a folder contains only the log files then you should be able to just specify the folder without having to think about patterns. If you choose to pass a RegEx then it will be used inside of the given folder e.g.
Perhaps a user interface such as streamlit or tensorboard?
- The PyTorch profiler is supposed to have an integration with tensorboard, so could be worth looking into.

The text was updated successfully, but these errors were encountered:

jarlsondre · 2024-10-08T08:25:51Z

Suggested solution

Based on the literature around Horovod, DDP and DeepSpeed (as well as a couple of others, like OneFlow), it seems that most papers focus on throughput, measured as samples/sec or FLOPS, and that a select few (e.g. Horovod) seem to put some emphasis into GPU Utilization and time spent on communication vs computation. Since wall-clock time should be inversely proportional to throughput and is much easier to measure, I suggest we use this.

Note: Any data displayed in the following plots will be completely fabricated by me, so don't read into the numbers.

Throughput

We measure the scalability of throughput in two ways:

Relative to time spent on a single node (or GPU), e.g.
Absolute time, e.g.

Communication vs. Computation

We measure communication vs. computation as a score from 0 to 1, where 0 means all the time was spent on communication and 1 means that all the time was spent on computation. An example can be seen here:

Note: The numbers 4, 8 and 16 refer to the number of GPUs in this plot. This is just a draft :)

GPU Utilization

Two key metrics:

GPU Utilization as a percentage of total utilization
- Done to understand how efficient the strategy is
- Will be measured using the same type of plot as with communication vs. computation
Absolute GPU usage in Watts and/or watt-hours
- Done to measure environmental impact
- Will be measured as an absolute number and compared between different configurations to give a more holistic picture on how well strategies scale.

matbun · 2024-10-08T12:57:10Z

I usually think to GPU utilization as the % of GPU in use. returned by nvidia-smi (or similar), which has no unit of measurement, but could be converted into FLOPS/sec knowing the GPU's peak FLOPS.

On the other hand, what you measure with the profiler gives you a break down of the compute time (communication, ops, I/O), which is useful to study scalability and find bottlenecks (for when we'll want to have a more "active" attitude towards scalability)

Wall clock time gives a nice overview (e.g., avg epoch time).

Regarding the report, I would suggest having a look at the tensorboard integration: https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html
Maybe it is not useful for us, but worth trying.

Also, other interesting references:

jarlsondre self-assigned this Sep 27, 2024

jarlsondre added the enhancement New feature or request label Sep 27, 2024

jarlsondre mentioned this issue Oct 16, 2024

Scalability report update - Communication Plot #227

Closed

matbun mentioned this issue Oct 16, 2024

Scalability report update - Communication Plot [cleaned up] #231

Merged

This was referenced Oct 28, 2024

Gpu monitoring #237

Merged

Scalability test wall clock #239

Merged

jarlsondre mentioned this issue Jan 7, 2025

[DRAFT] Scalability tutorial #262

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated scalability report (more comprehensive and easier to use) #221

Updated scalability report (more comprehensive and easier to use) #221

jarlsondre commented Sep 27, 2024 •

edited

Loading

jarlsondre commented Oct 8, 2024 •

edited

Loading

matbun commented Oct 8, 2024

Updated scalability report (more comprehensive and easier to use) #221

Updated scalability report (more comprehensive and easier to use) #221

Comments

jarlsondre commented Sep 27, 2024 • edited Loading

Summary

Metrics that should be included

Other improvements

jarlsondre commented Oct 8, 2024 • edited Loading

Suggested solution

Throughput

Communication vs. Computation

GPU Utilization

matbun commented Oct 8, 2024

jarlsondre commented Sep 27, 2024 •

edited

Loading

jarlsondre commented Oct 8, 2024 •

edited

Loading