-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about percentage of FP32 peak we get on H100 and A100 for miniBUDE #38
Comments
Hi, I'm currently investigating this. Now, my understanding for H100 is that it has a higher SM count but the register file size per SM is the same as A100. |
Please do let me know if you're able to get much higher than the original figure. |
For reference, see https://resources.nvidia.com/en-us-tensor-core table 4 on page 41: |
Thanks a lot for taking a look! If I understand correctly, then the issue is that with H100 it becomes register-bound due to less registers per core (Table 4 from the link), so the occupancy is lower and thus we can't hit the same peaks as before. |
Yes, NVIDIA doubled the FP32 unit count but the register size remained the same which is where I suspect the bottleneck lies. |
Hello,
We were comparing the percentage of FP32 peak we get on H100 and A100 for miniBUDE. With the
big5
input (and similarly for thebm_long
input) we've been seeing results like:We were expecting similar percentage of peaks on both A100 and H100 if the input size was big enough to saturate the GPU. Our best guess right now is that
big5
/bm_long
inputs aren't big enough to saturate the H100 as it's much larger than the A100. Does this match with your understanding? If so, are there any suggestions on a bigger input?Thanks!
The text was updated successfully, but these errors were encountered: