Improve CUDA performance for multi-dimensional arrays #48

charleskawczynski · 2024-10-10T17:35:04Z

The baseline performance of the multidimensional array kernels is not good. Based on our CUDA benchmarks:

N reads-writes: 8, N-reps: 1,  Float_type = Float32, Device_bandwidth_GBs=732
┌───────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬─────────────────────┐
│ funcs                             │ time per call                     │ bw %    │ achieved bw │ problem size        │
├───────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼─────────────────────┤
│ perf_kernel_shared_reads_unfused! │ 35 milliseconds, 32 microseconds  │ 4.70676 │ 34.4535     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_fused!   │ 37 milliseconds, 933 microseconds │ 4.34687 │ 31.8191     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_unfused! │ 6 milliseconds, 464 microseconds  │ 25.5066 │ 186.708     │ (40500000,)         │
│ perf_kernel_shared_reads_fused!   │ 3 milliseconds, 57 microseconds   │ 53.9236 │ 394.721     │ (40500000,)         │
└───────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴─────────────────────┘
 98.213180 seconds (41.61 M allocations: 5.144 GiB, 10.82% gc time, 36.05% compilation time: 1% of which was recompilation)
N reads-writes: 8, N-reps: 1,  Float_type = Float32, Device_bandwidth_GBs=732
┌──────────────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬─────────────────────┐
│ funcs                                    │ time per call                     │ bw %    │ achieved bw │ problem size        │
├──────────────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼─────────────────────┤
│ perf_kernel_shared_reads_writes_unfused! │ 24 milliseconds, 180 microseconds │ 6.81902 │ 49.9152     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_writes_fused!   │ 15 milliseconds, 248 microseconds │ 10.8137 │ 79.156      │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_writes_unfused! │ 3 milliseconds, 895 microseconds  │ 42.325  │ 309.819     │ (40500000,)         │
│ perf_kernel_shared_reads_writes_fused!   │ 2 milliseconds, 673 microseconds  │ 61.6844 │ 451.53      │ (40500000,)         │
└──────────────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴─────────────────────┘
 56.390611 seconds (3.79 M allocations: 2.642 GiB, 15.67% gc time, 6.64% compilation time)
N reads-writes: 8, N-reps: 1,  Float_type = Float32, Device_bandwidth_GBs=732
┌─────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬─────────────────────┐
│ funcs                   │ time per call                     │ bw %    │ achieved bw │ problem size        │
├─────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼─────────────────────┤
│ perf_kernel_unfused!    │ 34 milliseconds, 948 microseconds │ 4.71814 │ 34.5368     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_fused!      │ 37 milliseconds, 955 microseconds │ 4.34428 │ 31.8001     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_hard_coded! │ 3 milliseconds, 291 microseconds  │ 50.1021 │ 366.747     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_unfused!    │ 6 milliseconds, 463 microseconds  │ 25.5097 │ 186.731     │ (40500000,)         │
│ perf_kernel_fused!      │ 3 milliseconds, 58 microseconds   │ 53.9181 │ 394.68      │ (40500000,)         │
│ perf_kernel_hard_coded! │ 3 milliseconds, 339 microseconds  │ 49.3735 │ 361.414     │ (40500000,)         │
└─────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴─────────────────────┘
 81.495402 seconds (3.83 M allocations: 2.629 GiB, 15.54% gc time, 5.22% compilation time)

Our multi-dimensional array fusion is not improving over unfused kernels like our vector-fused kernels. Could be JuliaGPU/Metal.jl#101?

charleskawczynski · 2024-10-17T12:38:43Z

Also, perhaps related / tangential, we should see how things look when unrolling via Base.Cartesian.@nexprs.

charleskawczynski · 2024-10-18T18:03:12Z

This can easily be fixed by using a tuned multi-dimensional launch configuration (which is what we do in ClimaCore), however, fusion still doesn't occur unless we "force" linear indexing. This imposes a few problems on us:

We basically need to hack our own broadcasted object that supports linear indexing, like Workaround #28126, support SIMDing broadcast in more cases JuliaLang/julia#30973.
broadcasted index support seems to have changed in Julia 1.11 #1920 broke CI on Julia 1.11 ClimaCore.jl#1923 (need to open issue in julialang to better understand new required interface)
computing linear index offsets can only be done efficiently if we have datalayouts where the field index is (first or) last
using linear indexing is perfectly fine for pointwise kernels, but this could get pretty complicated for stencil kernels
I'm not sure what the implications are for our stencil kernels if we switch to field-ending datalayouts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CUDA performance for multi-dimensional arrays #48

Improve CUDA performance for multi-dimensional arrays #48

charleskawczynski commented Oct 10, 2024 •

edited

Loading

charleskawczynski commented Oct 17, 2024

charleskawczynski commented Oct 18, 2024 •

edited

Loading

Improve CUDA performance for multi-dimensional arrays #48

Improve CUDA performance for multi-dimensional arrays #48

Comments

charleskawczynski commented Oct 10, 2024 • edited Loading

charleskawczynski commented Oct 17, 2024

charleskawczynski commented Oct 18, 2024 • edited Loading

charleskawczynski commented Oct 10, 2024 •

edited

Loading

charleskawczynski commented Oct 18, 2024 •

edited

Loading