Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CUDA performance for multi-dimensional arrays #48

Open
charleskawczynski opened this issue Oct 10, 2024 · 2 comments
Open

Improve CUDA performance for multi-dimensional arrays #48

charleskawczynski opened this issue Oct 10, 2024 · 2 comments

Comments

@charleskawczynski
Copy link
Member

charleskawczynski commented Oct 10, 2024

The baseline performance of the multidimensional array kernels is not good. Based on our CUDA benchmarks:

N reads-writes: 8, N-reps: 1,  Float_type = Float32, Device_bandwidth_GBs=732
┌───────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬─────────────────────┐
│ funcs                             │ time per call                     │ bw %    │ achieved bw │ problem size        │
├───────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼─────────────────────┤
│ perf_kernel_shared_reads_unfused! │ 35 milliseconds, 32 microseconds  │ 4.7067634.4535     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_fused!   │ 37 milliseconds, 933 microseconds │ 4.3468731.8191     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_unfused! │ 6 milliseconds, 464 microseconds  │ 25.5066186.708     │ (40500000,)         │
│ perf_kernel_shared_reads_fused!   │ 3 milliseconds, 57 microseconds   │ 53.9236394.721     │ (40500000,)         │
└───────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴─────────────────────┘
 98.213180 seconds (41.61 M allocations: 5.144 GiB, 10.82% gc time, 36.05% compilation time: 1% of which was recompilation)
N reads-writes: 8, N-reps: 1,  Float_type = Float32, Device_bandwidth_GBs=732
┌──────────────────────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬─────────────────────┐
│ funcs                                    │ time per call                     │ bw %    │ achieved bw │ problem size        │
├──────────────────────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼─────────────────────┤
│ perf_kernel_shared_reads_writes_unfused! │ 24 milliseconds, 180 microseconds │ 6.8190249.9152     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_writes_fused!   │ 15 milliseconds, 248 microseconds │ 10.813779.156      │ (50, 5, 5, 6, 5400) │
│ perf_kernel_shared_reads_writes_unfused! │ 3 milliseconds, 895 microseconds  │ 42.325309.819     │ (40500000,)         │
│ perf_kernel_shared_reads_writes_fused!   │ 2 milliseconds, 673 microseconds  │ 61.6844451.53      │ (40500000,)         │
└──────────────────────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴─────────────────────┘
 56.390611 seconds (3.79 M allocations: 2.642 GiB, 15.67% gc time, 6.64% compilation time)
N reads-writes: 8, N-reps: 1,  Float_type = Float32, Device_bandwidth_GBs=732
┌─────────────────────────┬───────────────────────────────────┬─────────┬─────────────┬─────────────────────┐
│ funcs                   │ time per call                     │ bw %    │ achieved bw │ problem size        │
├─────────────────────────┼───────────────────────────────────┼─────────┼─────────────┼─────────────────────┤
│ perf_kernel_unfused!    │ 34 milliseconds, 948 microseconds │ 4.7181434.5368     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_fused!      │ 37 milliseconds, 955 microseconds │ 4.3442831.8001     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_hard_coded! │ 3 milliseconds, 291 microseconds  │ 50.1021366.747     │ (50, 5, 5, 6, 5400) │
│ perf_kernel_unfused!    │ 6 milliseconds, 463 microseconds  │ 25.5097186.731     │ (40500000,)         │
│ perf_kernel_fused!      │ 3 milliseconds, 58 microseconds   │ 53.9181394.68      │ (40500000,)         │
│ perf_kernel_hard_coded! │ 3 milliseconds, 339 microseconds  │ 49.3735361.414     │ (40500000,)         │
└─────────────────────────┴───────────────────────────────────┴─────────┴─────────────┴─────────────────────┘
 81.495402 seconds (3.83 M allocations: 2.629 GiB, 15.54% gc time, 5.22% compilation time)

Our multi-dimensional array fusion is not improving over unfused kernels like our vector-fused kernels. Could be JuliaGPU/Metal.jl#101?

@charleskawczynski
Copy link
Member Author

Also, perhaps related / tangential, we should see how things look when unrolling via Base.Cartesian.@nexprs.

@charleskawczynski
Copy link
Member Author

charleskawczynski commented Oct 18, 2024

This can easily be fixed by using a tuned multi-dimensional launch configuration (which is what we do in ClimaCore), however, fusion still doesn't occur unless we "force" linear indexing. This imposes a few problems on us:

  • We basically need to hack our own broadcasted object that supports linear indexing, like Workaround #28126, support SIMDing broadcast in more cases JuliaLang/julia#30973.
  • broadcasted index support seems to have changed in Julia 1.11 #1920 broke CI on Julia 1.11 ClimaCore.jl#1923 (need to open issue in julialang to better understand new required interface)
  • computing linear index offsets can only be done efficiently if we have datalayouts where the field index is (first or) last
  • using linear indexing is perfectly fine for pointwise kernels, but this could get pretty complicated for stencil kernels
  • I'm not sure what the implications are for our stencil kernels if we switch to field-ending datalayouts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant