-
Notifications
You must be signed in to change notification settings - Fork 16
Use ClimaCartesianIndices #2304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
8c80b51
to
a29c41e
Compare
1b0e649
to
036606a
Compare
Would it be possible to run ClimaAtmos CI with this change and measure the effect on SYPD for GPU runs? The simple array example you've linked looks promising but it would be good to measure a real-world example. |
Yes, absolutely. I have a few fixes pending. Assuming it’s strictly better, does this look alright? I will say that I'm hopeful in that the simple example improves because cuda cannot hide the latency of the expensive integer division when there are few loads / operations. So, I think it should be cheaper for all cases. |
My only concern is that |
3f5bc86
to
d5bb492
Compare
The copyto benchmarks show: # (previous) Multi-dimensional launch configuration (this is not robust to resolution changes)
N reads-writes: 2, N-reps: 10000, Float_type = Float64, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call │ bw % │ achieved bw │ problem size │
├───────┼──────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 12 microseconds, 890 nanoseconds │ 5.66957e-5 │ 0.00115602 │ (1, 1, 1, 1, 1) │
│ IJFH │ 14 microseconds, 780 nanoseconds │ 4.27211 │ 87.1083 │ (4, 4, 1, 1, 5400) │
│ IJHF │ 13 microseconds, 570 nanoseconds │ 4.65304 │ 94.8755 │ (4, 4, 1, 1, 5400) │
│ IFH │ 13 microseconds, 740 nanoseconds │ 1.14887 │ 23.4254 │ (4, 1, 1, 1, 5400) │
│ IHF │ 13 microseconds, 569 nanoseconds │ 1.16335 │ 23.7206 │ (4, 1, 1, 1, 5400) │
│ VF │ 12 microseconds, 900 nanoseconds │ 0.00356906 │ 0.0727731 │ (1, 1, 1, 63, 1) │
│ VIJFH │ 63 microseconds, 401 nanoseconds │ 62.7434 │ 1279.34 │ (4, 4, 1, 63, 5400) │
│ VIJHF │ 62 microseconds, 789 nanoseconds │ 63.3539 │ 1291.79 │ (4, 4, 1, 63, 5400) │
│ VIFH │ 20 microseconds, 230 nanoseconds │ 49.1612 │ 1002.4 │ (4, 1, 1, 63, 5400) │
│ VIHF │ 18 microseconds, 390 nanoseconds │ 54.0774 │ 1102.64 │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴────────────┴─────────────┴─────────────────────┘ CartesianIndices (main branch)
N reads-writes: 2, N-reps: 10000, Float_type = Float64, Device_bandwidth_GBs=2039
┌───────┬───────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call │ bw % │ achieved bw │ problem size │
├───────┼───────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 14 microseconds, 751 nanoseconds │ 4.95463e-5 │ 0.00101025 │ (1, 1, 1, 1, 1) │
│ IJFH │ 16 microseconds, 860 nanoseconds │ 3.74506 │ 76.3618 │ (4, 4, 1, 1, 5400) │
│ IJHF │ 13 microseconds, 679 nanoseconds │ 4.61596 │ 94.1195 │ (4, 4, 1, 1, 5400) │
│ IFH │ 15 microseconds, 241 nanoseconds │ 1.03579 │ 21.1198 │ (4, 1, 1, 1, 5400) │
│ IHF │ 13 microseconds, 451 nanoseconds │ 1.17364 │ 23.9305 │ (4, 1, 1, 1, 5400) │
│ VF │ 12 microseconds, 791 nanoseconds │ 0.00359975 │ 0.073399 │ (1, 1, 1, 63, 1) │
│ VIJFH │ 117 microseconds, 230 nanoseconds │ 33.933 │ 691.894 │ (4, 4, 1, 63, 5400) │
│ VIJHF │ 62 microseconds, 601 nanoseconds │ 63.5452 │ 1295.69 │ (4, 4, 1, 63, 5400) │
│ VIFH │ 36 microseconds, 520 nanoseconds │ 27.2312 │ 555.244 │ (4, 1, 1, 63, 5400) │
│ VIHF │ 18 microseconds, 70 nanoseconds │ 55.0381 │ 1122.23 │ (4, 1, 1, 63, 5400) │
└───────┴───────────────────────────────────┴────────────┴─────────────┴─────────────────────┘ FastCartesianIndices (this branch)
N reads-writes: 2, N-reps: 10000, Float_type = Float64, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call │ bw % │ achieved bw │ problem size │
├───────┼──────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 15 microseconds, 41 nanoseconds │ 4.85909e-5 │ 0.000990769 │ (1, 1, 1, 1, 1) │
│ IJFH │ 16 microseconds, 580 nanoseconds │ 3.80831 │ 77.6514 │ (4, 4, 1, 1, 5400) │
│ IJHF │ 13 microseconds, 661 nanoseconds │ 4.62238 │ 94.2504 │ (4, 4, 1, 1, 5400) │
│ IFH │ 15 microseconds, 659 nanoseconds │ 1.00807 │ 20.5546 │ (4, 1, 1, 1, 5400) │
│ IHF │ 13 microseconds, 570 nanoseconds │ 1.16326 │ 23.7189 │ (4, 1, 1, 1, 5400) │
│ VF │ 13 microseconds, 481 nanoseconds │ 0.00341549 │ 0.0696419 │ (1, 1, 1, 63, 1) │
│ VIJFH │ 87 microseconds, 241 nanoseconds │ 45.5976 │ 929.734 │ (4, 4, 1, 63, 5400) │
│ VIJHF │ 63 microseconds, 19 nanoseconds │ 63.1227 │ 1287.07 │ (4, 4, 1, 63, 5400) │
│ VIFH │ 27 microseconds, 370 nanoseconds │ 36.3348 │ 740.866 │ (4, 1, 1, 63, 5400) │
│ VIHF │ 18 microseconds, 179 nanoseconds │ 54.705 │ 1115.44 │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴────────────┴─────────────┴─────────────────────┘ So, we recover 25% performance loss from our reverting from multi-dimensional indices, which agrees with what I saw in the ClimaCartesianIndices docs. We can get another ~15-20% if we passed the indices through with I'll try running ClimaAtmos next, and compare against the main branch. |
Agreed. I changed it to a type assert and always use |
c544621
to
c866b9f
Compare
Let's just make this a direct dependency. |
347b75e
to
23cef18
Compare
Apply formatter Print more type info in JET tests Update deps Fixes
23cef18
to
4835656
Compare
I think these two builds are fair to compare (here I'm comparing the dry baro wave):
We should probably be doing a comparison on the A100 (though I posted microbenchmarks on an A100 in the ClimaCartesianIndices docs), but it's at least in agreement with microbenchmarks and shows that there is some benefit for real-world cases. |
Here's a build on A100 with ClimaCore 0.14.30 1 GPU dry baro: 3.583 Here's one with 0.14.31 1 GPU dry baro: 3.461 Here's one with ClimaCartesianIndices: 1 GPU dry baro: 3.434 |
A100 estimates are much more reliable, I agree e.g. On the P100 central queue, 68.48 SYPD in the linked branch above for the dry barowave problem, vs 83.905 SYPD in the latest RRTMGP interface update PR 3786 (Both these use RRTMGP 0.21.2 and ClimaCore 0.14.31). |
https://gist.github.com/akshaysridhar/fca4d50242a8bcc65faa8289b03cdc8f shows the target-gpu timings across 4 older commits between 0.14.30 and 0.14.31. |
The 0.14.31 also included upgraded dependencies for a bunch of other packages, including SciMLBase, GPUCompiler and a Tracy extension. Are you sure that this is a fair comparison? The tracy extension, which @kmdeck and I observed showed up in the nsight report, very well may have added overhead to those jobs. Did anyone look into this? Reporting sypd across versions can help identify if regressions occur, but it doesn't explain why. Has anyone looked at the nsight reports? The build @Sbozzolo posted seems to have a bunch of canceled jobs, so it's not clear to me where those numbers are coming from. I'd suggest we report links to specific jobs with nsight reports, so that we can identify what exactly has slowed down. |
I think it is a reasonable comparison (w.r.t additional noise from unrelated commits I mean) the buildkite IDs I've reported are a sampling of commits between the two releases between which the degradation was noticed. I'm not aware of the Tracy extension but we can take a look at this shortly. It's possible that identifying issues from these updated dependencies does indeed help regain some performance improvements - but this list attempts to simply modify the ClimaCore versions against |
Yeah, disabling shmem will improve performance for low vertical resolution jobs, but it will degrade performance for high-resolution jobs. The performance regression came from switching from a multi-dimensional launch configuration to a linear one + CartesianIndices. The shmem difference is orthogonal to versions 0.14.30 and 0.14.31. |
Noted, thanks. The strong-scaling |
It's really difficult to reason about the results of the gist. |
And yeah, 20 levels is low resolution |
Could you share builds where you see performance increase due to shared memory in an AMIP/Aquaplanet setup? What we observed is that disabling shared memory makes everything that is not baro wave/held suarez faster. We see this acorss multiple pipelines, the nightly AMIP (39 vertical elements, 5 % faster without shared memory), the benchmark AMIP in ClimaCoupler (63 elements, 30 % faster without shared memory), the gpu_aquaplanet case in the atmos target gpu pipeline (63 elements, 20-30 % faster without shared memory). |
I'm more interested to know of any kernels that are slower with shmem. |
This PR adds the use of ClimaCartesianIndices.jl, which can yield better gpu performance for simple
copyto!
kernels.This PR should move us from
perf_cart_index!(X, Y, CartesianIndices(...))
-like performance toperf_cart_index!(X, Y, fast_ci(...))
-like performance for most datalayouts (all but those that end with fields, which exhibitperf_linear_index!
-like performance).