-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking perf optimization of HopperMatmulTest.HSH_NT_128BSwizzle
for problem size (M=2048, N=2048, K=8192)
, CTA tile size (128, 256)
#3279
Comments
Initial perf as measured in #3281:
nvFuser/cuBLAS = |
zasdfgbnm
added a commit
that referenced
this issue
Oct 26, 2024
This shape makes more sense: #3137 (comment), #3279 Perf: ``` Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- -------- -------- -------- -------- ----------- ---------------------------------------------------------------------------------------------------- 43.2 205150 1 205150.0 205150.0 205150 205150 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>… 18.5 87550 1 87550.0 87550.0 87550 87550 0.0 nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT ``` nvFuser/cuBLAS = `42.7%`
There is a perf regression after the fix of elect-sync: Perf: Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns)
Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
47.8 247326 1 247326.0 247326.0 247326 247326 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
17.0 88191 1 88191.0 88191.0 88191 88191 0.0 nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT Perf nvFuser/cuBLAS: |
After #3294: Perf: Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns)
Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
39.0 172735 1 172735.0 172735.0 172735 172735 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
20.0 88768 1 88768.0 88768.0 88768 88768 0.0 nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT Perf nvFuser/cuBLAS: |
After #3314 Perf: Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
36.0 151775 1 151775.0 151775.0 151775 151775 0.0 <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
20.7 87135 1 87135.0 87135.0 87135 87135 0.0 nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT nvFuser/cuBLAS = |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The CTA tile size
(128, 256)
is a size that can relatively easily achieve high math throughput. The problem size is carefully selected as one full wave. I believe this is a good incremental task.Benchmark command:
Current perf on H200 on main as in the latest comment:
Perf:
nvFuser/cuBLAS =
57.4%
.The text was updated successfully, but these errors were encountered: