Skip to content

Significant perf drop when using dynamic ranges in GPU kernel #470

Open
@luraess

Description

@luraess

Running the CUDA benchmarks from the HPCBenchmarks.jl tests returns significant performance drop using KA with dynamic range definition. The blow tests are performed on GH200 using local CUDA 12.4 install and Julia 10.2.

diffusion_kernel_ka!(CUDABackend(), 256)($A_new, $A, $h; ndrange=($n, $n, $n))

returns a nearly 50% perf drop compared to plain CUDA.jl and reference CUDA C:

[ Info: diffusion 3D
[ Info: N = 256
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(104.865 μs)
  "reference" => Trial(92.161 μs)
  "julia-ka" => Trial(173.473 μs)
[ Info: N = 512
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(771.301 μs)
  "reference" => Trial(672.581 μs)
  "julia-ka" => Trial(1.299 ms)
[ Info: N = 1024
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(6.251 ms)
  "reference" => Trial(5.833 ms)
  "julia-ka" => Trial(10.285 ms)
  • While modifying it and using static range definition:
diffusion_kernel_ka!(CUDABackend(), 256, ($n, $n, $n))($A_new, $A, $h)

returns

[ Info: diffusion 3D
[ Info: N = 256
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(104.993 μs)
  "reference" => Trial(92.416 μs)
  "julia-ka" => Trial(103.649 μs)
[ Info: N = 512
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(770.790 μs)
  "reference" => Trial(672.037 μs)
  "julia-ka" => Trial(769.701 μs)
[ Info: N = 1024
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "julia" => Trial(6.250 ms)
  "reference" => Trial(5.873 ms)
  "julia-ka" => Trial(6.121 ms)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions