Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about percentage of FP32 peak we get on H100 and A100 for miniBUDE #38

Open
colleeneb opened this issue Sep 17, 2024 · 5 comments

Comments

@colleeneb
Copy link
Contributor

Hello,

We were comparing the percentage of FP32 peak we get on H100 and A100 for miniBUDE. With the big5 input (and similarly for the bm_long input) we've been seeing results like:

theoretical FP32 miniBUDE percent of FP32 peak
H100 67 20 30%
A100 19.5 12 62%

We were expecting similar percentage of peaks on both A100 and H100 if the input size was big enough to saturate the GPU. Our best guess right now is that big5 / bm_long inputs aren't big enough to saturate the H100 as it's much larger than the A100. Does this match with your understanding? If so, are there any suggestions on a bigger input?

Thanks!

@tom91136
Copy link
Member

tom91136 commented Oct 6, 2024

Hi,

I'm currently investigating this.
Preliminary benchmarking suggests that the occupancy seems lower due to high register pressure.
Controlled spilling via __launch_bounds__ seems to give some minor improvement but it's still in the range of 21~22 TFLOPs.

Now, my understanding for H100 is that it has a higher SM count but the register file size per SM is the same as A100.
A brief look at the core frequency improvement from A100 to H100 gives 1665Mhz (SXM96GB) / 1275Mhz (SXM80GB) = 1.3 which seem to be in the ballpark if we factor in the additional SMs at a lower occupancy.
Memory improvements is negligible as we're talking about single digit MBs worth of input and KBs worth of writes at the end).

CC @tomdeakin @addy419

@tom91136 tom91136 reopened this Oct 6, 2024
@tom91136
Copy link
Member

tom91136 commented Oct 6, 2024

Please do let me know if you're able to get much higher than the original figure.

@tom91136
Copy link
Member

tom91136 commented Oct 6, 2024

For reference, see https://resources.nvidia.com/en-us-tensor-core table 4 on page 41: Ratio of SM Registers to FP32 Cores

@colleeneb
Copy link
Contributor Author

Thanks a lot for taking a look!

If I understand correctly, then the issue is that with H100 it becomes register-bound due to less registers per core (Table 4 from the link), so the occupancy is lower and thus we can't hit the same peaks as before.

@tom91136
Copy link
Member

Yes, NVIDIA doubled the FP32 unit count but the register size remained the same which is where I suspect the bottleneck lies.
We're currently trying to set something up with NVIDIA and see if they can take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants