Open
Description
Running the CUDA benchmarks from the HPCBenchmarks.jl tests returns significant performance drop using KA with dynamic range definition. The blow tests are performed on GH200 using local CUDA 12.4 install and Julia 10.2.
- Using dynamic ranges
ndrange
as implemented in the benchmark https://github.com/PTsolvers/HPCBenchmarks.jl/blob/a5985aaaf931efb0caf194d669e3bfcb90c5c08e/CUDA/diffusion_3d.jl#L39:
diffusion_kernel_ka!(CUDABackend(), 256)($A_new, $A, $h; ndrange=($n, $n, $n))
returns a nearly 50% perf drop compared to plain CUDA.jl and reference CUDA C:
[ Info: diffusion 3D
[ Info: N = 256
3-element BenchmarkTools.BenchmarkGroup:
tags: []
"julia" => Trial(104.865 μs)
"reference" => Trial(92.161 μs)
"julia-ka" => Trial(173.473 μs)
[ Info: N = 512
3-element BenchmarkTools.BenchmarkGroup:
tags: []
"julia" => Trial(771.301 μs)
"reference" => Trial(672.581 μs)
"julia-ka" => Trial(1.299 ms)
[ Info: N = 1024
3-element BenchmarkTools.BenchmarkGroup:
tags: []
"julia" => Trial(6.251 ms)
"reference" => Trial(5.833 ms)
"julia-ka" => Trial(10.285 ms)
- While modifying it and using static range definition:
diffusion_kernel_ka!(CUDABackend(), 256, ($n, $n, $n))($A_new, $A, $h)
returns
[ Info: diffusion 3D
[ Info: N = 256
3-element BenchmarkTools.BenchmarkGroup:
tags: []
"julia" => Trial(104.993 μs)
"reference" => Trial(92.416 μs)
"julia-ka" => Trial(103.649 μs)
[ Info: N = 512
3-element BenchmarkTools.BenchmarkGroup:
tags: []
"julia" => Trial(770.790 μs)
"reference" => Trial(672.037 μs)
"julia-ka" => Trial(769.701 μs)
[ Info: N = 1024
3-element BenchmarkTools.BenchmarkGroup:
tags: []
"julia" => Trial(6.250 ms)
"reference" => Trial(5.873 ms)
"julia-ka" => Trial(6.121 ms)
Metadata
Metadata
Assignees
Labels
No labels