Significant perf drop when using dynamic ranges in GPU kernel

Running the CUDA benchmarks from the [HPCBenchmarks.jl](https://github.com/PTsolvers/HPCBenchmarks.jl/tree/main/CUDA) tests returns significant performance drop using KA with dynamic range definition. The blow tests are performed on GH200 using local CUDA 12.4 install and Julia 10.2.

- Using dynamic ranges `ndrange` as implemented in the benchmark https://github.com/PTsolvers/HPCBenchmarks.jl/blob/a5985aaaf931efb0caf194d669e3bfcb90c5c08e/CUDA/diffusion_3d.jl#L39: 
```julia
diffusion_kernel_ka!(CUDABackend(), 256)($A_new, $A, $h; ndrange=($n, $n, $n))
```
returns a nearly 50% perf drop compared to plain CUDA.jl and reference CUDA C:
```
[ Info: diffusion 3D
[ Info: N = 256
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(104.865 μs)
  "reference" => Trial(92.161 μs)
  "julia-ka" => Trial(173.473 μs)
[ Info: N = 512
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(771.301 μs)
  "reference" => Trial(672.581 μs)
  "julia-ka" => Trial(1.299 ms)
[ Info: N = 1024
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(6.251 ms)
  "reference" => Trial(5.833 ms)
  "julia-ka" => Trial(10.285 ms)
```

- While modifying it and using static range definition:
```julia
diffusion_kernel_ka!(CUDABackend(), 256, ($n, $n, $n))($A_new, $A, $h)
```
returns
```
[ Info: diffusion 3D
[ Info: N = 256
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(104.993 μs)
  "reference" => Trial(92.416 μs)
  "julia-ka" => Trial(103.649 μs)
[ Info: N = 512
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(770.790 μs)
  "reference" => Trial(672.037 μs)
  "julia-ka" => Trial(769.701 μs)
[ Info: N = 1024
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(6.250 ms)
  "reference" => Trial(5.873 ms)
  "julia-ka" => Trial(6.121 ms)
  ```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Significant perf drop when using dynamic ranges in GPU kernel #470

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Significant perf drop when using dynamic ranges in GPU kernel #470

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions