Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking perf optimization of HopperMatmulTest.HSH_NT_128BSwizzle for problem size (M=2048, N=2048, K=8192), CTA tile size (128, 256) #3279

Open
zasdfgbnm opened this issue Oct 25, 2024 · 4 comments
Assignees
Labels
H100 Perf improve performance on H100 Matmuls

Comments

@zasdfgbnm
Copy link
Collaborator

zasdfgbnm commented Oct 25, 2024

The CTA tile size (128, 256) is a size that can relatively easily achieve high math throughput. The problem size is carefully selected as one full wave. I believe this is a good incremental task.

Benchmark command:

nsys nvprof ./bin/test_matmul --gtest_filter=HopperMatmulTest.HSH_NT_128BSwizzle

Current perf on H200 on main as in the latest comment:

Perf:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     36.0           151775          1  151775.0  151775.0    151775    151775          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     20.7            87135          1   87135.0   87135.0     87135     87135          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

nvFuser/cuBLAS = 57.4%.

@zasdfgbnm
Copy link
Collaborator Author

Initial perf as measured in #3281:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     43.2           205150          1  205150.0  205150.0    205150    205150          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     18.5            87550          1   87550.0   87550.0     87550     87550          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

nvFuser/cuBLAS = 42.7%

zasdfgbnm added a commit that referenced this issue Oct 26, 2024
This shape makes more sense:
#3137 (comment),
#3279

Perf:
```
 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     43.2           205150          1  205150.0  205150.0    205150    205150          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     18.5            87550          1   87550.0   87550.0     87550     87550          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT
```

nvFuser/cuBLAS = `42.7%`
@zasdfgbnm
Copy link
Collaborator Author

There is a perf regression after the fix of elect-sync:
#3295

Perf:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)
 Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     47.8           247326          1  247326.0  247326.0    247326    247326          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     17.0            88191          1   88191.0   88191.0     88191     88191          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

Perf nvFuser/cuBLAS: 35.6%

@zasdfgbnm
Copy link
Collaborator Author

After #3294:

Perf:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)
 Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     39.0           172735          1  172735.0  172735.0    172735    172735          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     20.0            88768          1   88768.0   88768.0     88768     88768          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

Perf nvFuser/cuBLAS: 51.4%.

@zasdfgbnm
Copy link
Collaborator Author

zasdfgbnm commented Oct 31, 2024

After #3314

Perf:

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name

 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     36.0           151775          1  151775.0  151775.0    151775    151775          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     20.7            87135          1   87135.0   87135.0     87135     87135          0.0  nvjet_hsh_256x128_64x4_1x2_h_bz_coopA_NTT

nvFuser/cuBLAS = 57.4%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
H100 Perf improve performance on H100 Matmuls
Projects
None yet
Development

No branches or pull requests

1 participant