Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Nvidia Hopper GPUs #27

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

giordano
Copy link

@giordano giordano commented Jan 13, 2024

This is an initial attempt to support Nvidia Hopper GPUs, opening as draft because lots of thing still don't work. For example, theoretical peakflops for tensor cores is wrong, it looks like the formula used for A100 doesn't apply to Hopper. I tried to adapt based on figures 10-11 of https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper but with GH200 I get:

julia> for tensorcores in (false, true), dtype in (Float64, Float32, Float16, Int8)
           dtype in (Int8, Float16) && !tensorcores || theoretical_peakflops_gpu(; dtype, tensorcores)
       end
Theoretical Peakflops (TFLOP/s):
 ├ tensorcores: false
 ├ dtype: Float64
 └ max: 33.5
Theoretical Peakflops (TFLOP/s):
 ├ tensorcores: false
 ├ dtype: Float32
 └ max: 66.9
Theoretical Peakflops (TFLOP/s):
 ├ tensorcores: true
 ├ dtype: Float64
 └ max: 66.9
Theoretical Peakflops (TFLOP/s):
 ├ tensorcores: true
 ├ dtype: Float32
 └ max: 535.3
Theoretical Peakflops (TFLOP/s):
 ├ tensorcores: true
 ├ dtype: Float16
 └ max: 1070.5
Theoretical Peakflops (TOP/s):
 ├ tensorcores: true
 ├ dtype: Int8
 └ max: 4282.1

Values for Float64 (with and without tensorcores) and Float32 (without tensorcores) are good, but all other tensorcores peakflops are wrong according to column "H100 SXM5" table 2 of the document above, it should be 494.7 TFLOP/s for Float32, 989.4 for Float16, and 1978.9 for Int8 (also https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-superchip which is specific to GH200 agrees with those numbers, but it has fewer significant digits, they rounded to integers).

if Symbol(dtype) == :Float16
# matrix dimensions 8x8x4, factor 2 for nflops in A*B+C
# see e.g. https://peerj.com/articles/cs-330.pdf
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I replaced this link with the DOI because the link is now broken.

@giordano giordano force-pushed the mg/hopper branch 2 times, most recently from 266b15a to d567774 Compare January 13, 2024 18:09
@giordano
Copy link
Author

Trying to measure peakflops, I get

julia> for tensorcores in (false, true), dtype in (Float64, Float32, Float16, Int8)
           (dtype in (Int8, Float16) && !tensorcores) || (dtype in (Float64, Float32) && tensorcores) || GPUInspector.peakflops_gpu(; dtype, tensorcores)
       end
Peakflops (TFLOP/s):
 ├ tensorcores: false
 ├ dtype: Float64
 └ max: 22.3
Peakflops (TFLOP/s):
 ├ tensorcores: false
 ├ dtype: Float32
 └ max: 32.3
Peakflops (TFLOP/s):
 ├ tensorcores: true
 ├ dtype: Float16
 └ max: 633.2
Peakflops (TOP/s):
 ├ tensorcores: true
 ├ dtype: Int8
 └ max: 940.9

These results are quite far from the theoretical peaks, about 50% less, is there anything to tweak in the kernels for a new architecture?

Copy link
Author

@giordano giordano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly enough, all theoretical tensorcores peakflops for Float32, Float16, and Int8 are wrong by about 8%

julia> 535.3 / 494.7
1.0820699413786132

julia> 1070.5 / 989.4
1.0819688700222356

julia> (4282.1 / 2) / 1978.9
1.0819394613168933

but I have no clue of where this factor comes from.

elseif Symbol(dtype) == :Float64
max_peakflops *= 2 * 4 * 4 * 2
elseif Symbol(dtype) == :Int8
max_peakflops *= 2 * 2 * 32 * 8 * 4 # XXX: Wrong result!
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there's an extra factor of 2 in this formula, but I based this on the Int8 calculation below

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant