Question about percentage of FP32 peak we get on H100 and A100 for miniBUDE #38

colleeneb · 2024-09-17T19:35:15Z

Hello,

We were comparing the percentage of FP32 peak we get on H100 and A100 for miniBUDE. With the big5 input (and similarly for the bm_long input) we've been seeing results like:

	theoretical FP32	miniBUDE	percent of FP32 peak
H100	67	20	30%
A100	19.5	12	62%

We were expecting similar percentage of peaks on both A100 and H100 if the input size was big enough to saturate the GPU. Our best guess right now is that big5 / bm_long inputs aren't big enough to saturate the H100 as it's much larger than the A100. Does this match with your understanding? If so, are there any suggestions on a bigger input?

Thanks!

The text was updated successfully, but these errors were encountered:

tom91136 · 2024-10-06T02:13:15Z

Hi,

I'm currently investigating this.
Preliminary benchmarking suggests that the occupancy seems lower due to high register pressure.
Controlled spilling via __launch_bounds__ seems to give some minor improvement but it's still in the range of 21~22 TFLOPs.

Now, my understanding for H100 is that it has a higher SM count but the register file size per SM is the same as A100.
A brief look at the core frequency improvement from A100 to H100 gives 1665Mhz (SXM96GB) / 1275Mhz (SXM80GB) = 1.3 which seem to be in the ballpark if we factor in the additional SMs at a lower occupancy.
Memory improvements is negligible as we're talking about single digit MBs worth of input and KBs worth of writes at the end).

CC @tomdeakin @addy419

tom91136 · 2024-10-06T02:14:25Z

Please do let me know if you're able to get much higher than the original figure.

tom91136 · 2024-10-06T03:22:41Z

For reference, see https://resources.nvidia.com/en-us-tensor-core table 4 on page 41: Ratio of SM Registers to FP32 Cores

colleeneb · 2024-10-15T16:22:35Z

Thanks a lot for taking a look!

If I understand correctly, then the issue is that with H100 it becomes register-bound due to less registers per core (Table 4 from the link), so the occupancy is lower and thus we can't hit the same peaks as before.

tom91136 · 2024-10-15T16:30:30Z

Yes, NVIDIA doubled the FP32 unit count but the register size remained the same which is where I suspect the bottleneck lies.
We're currently trying to set something up with NVIDIA and see if they can take a look.

colleeneb closed this as completed Sep 17, 2024

tom91136 reopened this Oct 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about percentage of FP32 peak we get on H100 and A100 for miniBUDE #38

Question about percentage of FP32 peak we get on H100 and A100 for miniBUDE #38

colleeneb commented Sep 17, 2024

tom91136 commented Oct 6, 2024

tom91136 commented Oct 6, 2024

tom91136 commented Oct 6, 2024

colleeneb commented Oct 15, 2024

tom91136 commented Oct 15, 2024

Question about percentage of FP32 peak we get on H100 and A100 for miniBUDE #38

Question about percentage of FP32 peak we get on H100 and A100 for miniBUDE #38

Comments

colleeneb commented Sep 17, 2024

tom91136 commented Oct 6, 2024

tom91136 commented Oct 6, 2024

tom91136 commented Oct 6, 2024

colleeneb commented Oct 15, 2024

tom91136 commented Oct 15, 2024