Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated scalability report (more comprehensive and easier to use) #221

Open
jarlsondre opened this issue Sep 27, 2024 · 2 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@jarlsondre
Copy link
Collaborator

jarlsondre commented Sep 27, 2024

Summary

The scalability report that you get from itwinai scalability_report shows the relative speedup between running a job on a single node vs. on multiple nodes. However, this is not very comprehensive and has some issues with the user experience.

Metrics that should be included

  • Throughput
    • FLOPS, samples/sec, or just total runtime (as this will be inversely proportional)
  • Communication Overhead
    • How much time is spent performing communication vs. actual computation
  • GPU Utilization
    • How much of the GPUs is utilized e.g. on average (as a score from 0 to 100%)

Other improvements

  • Allowing to pass a folder without needing a RegEx pattern
    • Could still allow the use of RegEx, but if you know that a folder contains only the log files then you should be able to just specify the folder without having to think about patterns. If you choose to pass a RegEx then it will be used inside of the given folder e.g.
  • Perhaps a user interface such as streamlit or tensorboard?
    • The PyTorch profiler is supposed to have an integration with tensorboard, so could be worth looking into.
@jarlsondre jarlsondre self-assigned this Sep 27, 2024
@jarlsondre jarlsondre added the enhancement New feature or request label Sep 27, 2024
@jarlsondre
Copy link
Collaborator Author

jarlsondre commented Oct 8, 2024

Suggested solution

Based on the literature around Horovod, DDP and DeepSpeed (as well as a couple of others, like OneFlow), it seems that most papers focus on throughput, measured as samples/sec or FLOPS, and that a select few (e.g. Horovod) seem to put some emphasis into GPU Utilization and time spent on communication vs computation. Since wall-clock time should be inversely proportional to throughput and is much easier to measure, I suggest we use this.

Note: Any data displayed in the following plots will be completely fabricated by me, so don't read into the numbers.

Throughput

We measure the scalability of throughput in two ways:

  • Relative to time spent on a single node (or GPU), e.g.
    image
  • Absolute time, e.g.
    image

Communication vs. Computation

We measure communication vs. computation as a score from 0 to 1, where 0 means all the time was spent on communication and 1 means that all the time was spent on computation. An example can be seen here:
image
Note: The numbers 4, 8 and 16 refer to the number of GPUs in this plot. This is just a draft :)

GPU Utilization

Two key metrics:

  • GPU Utilization as a percentage of total utilization
    • Done to understand how efficient the strategy is
    • Will be measured using the same type of plot as with communication vs. computation
  • Absolute GPU usage in Watts and/or watt-hours
    • Done to measure environmental impact
    • Will be measured as an absolute number and compared between different configurations to give a more holistic picture on how well strategies scale.

@matbun
Copy link
Collaborator

matbun commented Oct 8, 2024

I usually think to GPU utilization as the % of GPU in use. returned by nvidia-smi (or similar), which has no unit of measurement, but could be converted into FLOPS/sec knowing the GPU's peak FLOPS.

On the other hand, what you measure with the profiler gives you a break down of the compute time (communication, ops, I/O), which is useful to study scalability and find bottlenecks (for when we'll want to have a more "active" attitude towards scalability)

Wall clock time gives a nice overview (e.g., avg epoch time).

Regarding the report, I would suggest having a look at the tensorboard integration: https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html
Maybe it is not useful for us, but worth trying.

Also, other interesting references:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants