Use ClimaCartesianIndices #2304

charleskawczynski · 2025-04-21T15:45:10Z

This PR adds the use of ClimaCartesianIndices.jl, which can yield better gpu performance for simple copyto! kernels.

This PR should move us from perf_cart_index!(X, Y, CartesianIndices(...))-like performance to perf_cart_index!(X, Y, fast_ci(...))-like performance for most datalayouts (all but those that end with fields, which exhibit perf_linear_index!-like performance).

dennisYatunin · 2025-04-21T20:37:51Z

Would it be possible to run ClimaAtmos CI with this change and measure the effect on SYPD for GPU runs? The simple array example you've linked looks promising but it would be good to measure a real-world example.

charleskawczynski · 2025-04-21T21:22:25Z

Would it be possible to run ClimaAtmos CI with this change and measure the effect on SYPD for GPU runs? The simple array example you've linked looks promising but it would be good to measure a real-world example.

Yes, absolutely. I have a few fixes pending. Assuming it’s strictly better, does this look alright?

I will say that I'm hopeful in that the simple example improves because cuda cannot hide the latency of the expensive integer division when there are few loads / operations. So, I think it should be cheaper for all cases.

dennisYatunin · 2025-04-21T22:48:04Z

My only concern is that length(CI) > typemax(Int32) might be type-unstable and trigger allocations. If you think that's fine then I'm okay with these changes.

charleskawczynski · 2025-04-22T13:50:13Z

The copyto benchmarks show:

# (previous) Multi-dimensional launch configuration (this is not robust to resolution changes)
N reads-writes: 2, N-reps: 10000,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                    │ bw %       │ achieved bw │ problem size        │
├───────┼──────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 12 microseconds, 890 nanoseconds │ 5.66957e-5 │ 0.00115602  │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 14 microseconds, 780 nanoseconds │ 4.27211    │ 87.1083     │ (4, 4, 1, 1, 5400)  │
│ IJHF  │ 13 microseconds, 570 nanoseconds │ 4.65304    │ 94.8755     │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 13 microseconds, 740 nanoseconds │ 1.14887    │ 23.4254     │ (4, 1, 1, 1, 5400)  │
│ IHF   │ 13 microseconds, 569 nanoseconds │ 1.16335    │ 23.7206     │ (4, 1, 1, 1, 5400)  │
│ VF    │ 12 microseconds, 900 nanoseconds │ 0.00356906 │ 0.0727731   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 63 microseconds, 401 nanoseconds │ 62.7434    │ 1279.34     │ (4, 4, 1, 63, 5400) │
│ VIJHF │ 62 microseconds, 789 nanoseconds │ 63.3539    │ 1291.79     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 20 microseconds, 230 nanoseconds │ 49.1612    │ 1002.4      │ (4, 1, 1, 63, 5400) │
│ VIHF  │ 18 microseconds, 390 nanoseconds │ 54.0774    │ 1102.64     │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴────────────┴─────────────┴─────────────────────┘

CartesianIndices (main branch)
N reads-writes: 2, N-reps: 10000,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────┬───────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                     │ bw %       │ achieved bw │ problem size        │
├───────┼───────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 14 microseconds, 751 nanoseconds  │ 4.95463e-5 │ 0.00101025  │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 16 microseconds, 860 nanoseconds  │ 3.74506    │ 76.3618     │ (4, 4, 1, 1, 5400)  │
│ IJHF  │ 13 microseconds, 679 nanoseconds  │ 4.61596    │ 94.1195     │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 15 microseconds, 241 nanoseconds  │ 1.03579    │ 21.1198     │ (4, 1, 1, 1, 5400)  │
│ IHF   │ 13 microseconds, 451 nanoseconds  │ 1.17364    │ 23.9305     │ (4, 1, 1, 1, 5400)  │
│ VF    │ 12 microseconds, 791 nanoseconds  │ 0.00359975 │ 0.073399    │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 117 microseconds, 230 nanoseconds │ 33.933     │ 691.894     │ (4, 4, 1, 63, 5400) │
│ VIJHF │ 62 microseconds, 601 nanoseconds  │ 63.5452    │ 1295.69     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 36 microseconds, 520 nanoseconds  │ 27.2312    │ 555.244     │ (4, 1, 1, 63, 5400) │
│ VIHF  │ 18 microseconds, 70 nanoseconds   │ 55.0381    │ 1122.23     │ (4, 1, 1, 63, 5400) │
└───────┴───────────────────────────────────┴────────────┴─────────────┴─────────────────────┘

FastCartesianIndices (this branch)
N reads-writes: 2, N-reps: 10000,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                    │ bw %       │ achieved bw │ problem size        │
├───────┼──────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 15 microseconds, 41 nanoseconds  │ 4.85909e-5 │ 0.000990769 │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 16 microseconds, 580 nanoseconds │ 3.80831    │ 77.6514     │ (4, 4, 1, 1, 5400)  │
│ IJHF  │ 13 microseconds, 661 nanoseconds │ 4.62238    │ 94.2504     │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 15 microseconds, 659 nanoseconds │ 1.00807    │ 20.5546     │ (4, 1, 1, 1, 5400)  │
│ IHF   │ 13 microseconds, 570 nanoseconds │ 1.16326    │ 23.7189     │ (4, 1, 1, 1, 5400)  │
│ VF    │ 13 microseconds, 481 nanoseconds │ 0.00341549 │ 0.0696419   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 87 microseconds, 241 nanoseconds │ 45.5976    │ 929.734     │ (4, 4, 1, 63, 5400) │
│ VIJHF │ 63 microseconds, 19 nanoseconds  │ 63.1227    │ 1287.07     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 27 microseconds, 370 nanoseconds │ 36.3348    │ 740.866     │ (4, 1, 1, 63, 5400) │
│ VIHF  │ 18 microseconds, 179 nanoseconds │ 54.705     │ 1115.44     │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴────────────┴─────────────┴─────────────────────┘

So, we recover 25% performance loss from our reverting from multi-dimensional indices, which agrees with what I saw in the ClimaCartesianIndices docs. We can get another ~15-20% if we passed the indices through with Val, but we don't have Nh in the type space, so that would be type-unstable / dynamic allocation.

I'll try running ClimaAtmos next, and compare against the main branch.

charleskawczynski · 2025-04-22T15:58:23Z

My only concern is that length(CI) > typemax(Int32) might be type-unstable and trigger allocations. If you think that's fine then I'm okay with these changes.

Agreed. I changed it to a type assert and always use FastCartesianIndices. I don't think that we'll need more than 2 billion points per process (I'm not sure that we can even make a space that big)

charleskawczynski · 2025-04-28T16:24:03Z

Let's just make this a direct dependency.

Apply formatter Print more type info in JET tests Update deps Fixes

charleskawczynski · 2025-04-28T19:26:53Z

I think these two builds are fair to compare (here I'm comparing the dry baro wave):

https://buildkite.com/clima/climaatmos-ci/builds/24041#01967d72-2013-4b68-916f-0b19dccbbf0a (main branch, sypd=68.483, P100)
https://buildkite.com/clima/climaatmos-ci/builds/24043#01967db1-4b79-4c59-aad5-794c3183d862 (this branch, sypd=78.787, P100)

We should probably be doing a comparison on the A100 (though I posted microbenchmarks on an A100 in the ClimaCartesianIndices docs), but it's at least in agreement with microbenchmarks and shows that there is some benefit for real-world cases.

cc @dennisYatunin, @Sbozzolo

Sbozzolo · 2025-04-29T17:33:16Z

Here's a build on A100 with ClimaCore 0.14.30
https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/418

1 GPU dry baro: 3.583
4 GPU dry baro: 10.728
1 GPU Moist HS: 1.856
4 GPU Moist HS: 5.616
1 GPU Aquaplanet: 0.694
2 GPU Aquaplanet: 1.358

Here's one with 0.14.31
https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/419

1 GPU dry baro: 3.461
4 GPU dry baro: 10.564
1 GPU Moist HS: 1.801
4 GPU Moist HS: 5.745
1 GPU Aquaplanet: 0.503
2 GPU Aquaplanet: 1.005

Here's one with ClimaCartesianIndices:
https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/420

1 GPU dry baro: 3.434
4 GPU dry baro: 10.646
1 GPU Moist HS: 1.755
4 GPU Moist HS: 5.519
1 GPU Aquaplanet: 0.504
2 GPU Aquaplanet: 1.005

akshaysridhar · 2025-04-30T15:59:49Z

I think these two builds are fair to compare (here I'm comparing the dry baro wave):
* https://buildkite.com/clima/climaatmos-ci/builds/24041#01967d72-2013-4b68-916f-0b19dccbbf0a (main branch, sypd=68.483, P100)

* https://buildkite.com/clima/climaatmos-ci/builds/24043#01967db1-4b79-4c59-aad5-794c3183d862 (this branch, sypd=78.787, P100)
We should probably be doing a comparison on the A100 (though I posted microbenchmarks on an A100 in the ClimaCartesianIndices docs), but it's at least in agreement with microbenchmarks and shows that there is some benefit for real-world cases.

cc @dennisYatunin, @Sbozzolo

A100 estimates are much more reliable, I agree e.g. On the P100 central queue, 68.48 SYPD in the linked branch above for the dry barowave problem, vs 83.905 SYPD in the latest RRTMGP interface update PR 3786 (Both these use RRTMGP 0.21.2 and ClimaCore 0.14.31).

akshaysridhar · 2025-05-01T22:01:24Z

https://gist.github.com/akshaysridhar/fca4d50242a8bcc65faa8289b03cdc8f shows the target-gpu timings across 4 older commits between 0.14.30 and 0.14.31.

charleskawczynski · 2025-05-11T16:21:37Z

Here's a build on A100 with ClimaCore 0.14.30 https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/418

The 0.14.31 also included upgraded dependencies for a bunch of other packages, including SciMLBase, GPUCompiler and a Tracy extension. Are you sure that this is a fair comparison? The tracy extension, which @kmdeck and I observed showed up in the nsight report, very well may have added overhead to those jobs. Did anyone look into this?

Reporting sypd across versions can help identify if regressions occur, but it doesn't explain why. Has anyone looked at the nsight reports? The build @Sbozzolo posted seems to have a bunch of canceled jobs, so it's not clear to me where those numbers are coming from. I'd suggest we report links to specific jobs with nsight reports, so that we can identify what exactly has slowed down.

akshaysridhar · 2025-05-13T15:32:02Z

Here's a build on A100 with ClimaCore 0.14.30 https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/418

The 0.14.31 also included upgraded dependencies for a bunch of other packages, including SciMLBase, GPUCompiler and a Tracy extension. Are you sure that this is a fair comparison? The tracy extension, which @kmdeck and I observed showed up in the nsight report, very well may have added overhead to those jobs. Did anyone look into this?

Reporting sypd across versions can help identify if regressions occur, but it doesn't explain why. Has anyone looked at the nsight reports? The build @Sbozzolo posted seems to have a bunch of canceled jobs, so it's not clear to me where those numbers are coming from. I'd suggest we report links to specific jobs with nsight reports, so that we can identify what exactly has slowed down.

I think it is a reasonable comparison (w.r.t additional noise from unrelated commits I mean) the buildkite IDs I've reported are a sampling of commits between the two releases between which the degradation was noticed. I'm not aware of the Tracy extension but we can take a look at this shortly. It's possible that identifying issues from these updated dependencies does indeed help regain some performance improvements - but this list attempts to simply modify the ClimaCore versions against [b2c96348] ClimaAtmos v0.30.0 It seems that, with these updated dependencies listed above, disabling fd shmem helped recover a significant portion of the SYPD degradation. (It's worth re-running an updated version of this branch against ClimaAtmos main in any case for reference - I'll start that job shortly).

charleskawczynski · 2025-05-13T15:40:34Z

It seems that, with these updated dependencies listed above, disabling fd shmem helped recover a significant portion of the SYPD degradation. (It's worth re-running an updated version of this branch against ClimaAtmos main in any case for reference - I'll start that job shortly).

Yeah, disabling shmem will improve performance for low vertical resolution jobs, but it will degrade performance for high-resolution jobs. The performance regression came from switching from a multi-dimensional launch configuration to a linear one + CartesianIndices. The shmem difference is orthogonal to versions 0.14.30 and 0.14.31.

akshaysridhar · 2025-05-13T15:46:34Z

It seems that, with these updated dependencies listed above, disabling fd shmem helped recover a significant portion of the SYPD degradation. (It's worth re-running an updated version of this branch against ClimaAtmos main in any case for reference - I'll start that job shortly).

Yeah, disabling shmem will improve performance for low vertical resolution jobs, but it will degrade performance for high-resolution jobs. The performance regression came from switching from a multi-dimensional launch configuration to a linear one + CartesianIndices. The shmem difference is orthogonal to versions 0.14.30 and 0.14.31.

Noted, thanks. The strong-scaling dyamond jobs in the linked Gist above all use 63 layers (64 levels) so comparisons across those jobs are reasonable. What's considered low-res vertical resolution here - something like less than ~20 vertical elements ?

charleskawczynski · 2025-05-13T17:21:43Z

It's really difficult to reason about the results of the gist.

charleskawczynski · 2025-05-13T17:22:06Z

And yeah, 20 levels is low resolution

Sbozzolo · 2025-05-13T17:59:32Z

It seems that, with these updated dependencies listed above, disabling fd shmem helped recover a significant portion of the SYPD degradation. (It's worth re-running an updated version of this branch against ClimaAtmos main in any case for reference - I'll start that job shortly).

Yeah, disabling shmem will improve performance for low vertical resolution jobs, but it will degrade performance for high-resolution jobs. The performance regression came from switching from a multi-dimensional launch configuration to a linear one + CartesianIndices. The shmem difference is orthogonal to versions 0.14.30 and 0.14.31.

Could you share builds where you see performance increase due to shared memory in an AMIP/Aquaplanet setup?

What we observed is that disabling shared memory makes everything that is not baro wave/held suarez faster. We see this acorss multiple pipelines, the nightly AMIP (39 vertical elements, 5 % faster without shared memory), the benchmark AMIP in ClimaCoupler (63 elements, 30 % faster without shared memory), the gpu_aquaplanet case in the atmos target gpu pipeline (63 elements, 20-30 % faster without shared memory).

charleskawczynski · 2025-05-13T19:07:00Z

Could you share builds where you see performance increase due to shared memory in an AMIP/Aquaplanet setup?

What we observed is that disabling shared memory makes everything that is not baro wave/held suarez faster. We see this acorss multiple pipelines, the nightly AMIP (39 vertical elements, 5 % faster without shared memory), the benchmark AMIP in ClimaCoupler (63 elements, 30 % faster without shared memory), the gpu_aquaplanet case in the atmos target gpu pipeline (63 elements, 20-30 % faster without shared memory).

I'm more interested to know of any kernels that are slower with shmem.

charleskawczynski added the performance label Apr 21, 2025

charleskawczynski requested a review from dennisYatunin April 21, 2025 15:45

charleskawczynski force-pushed the ck/clima_cartesian_indices branch from 8c80b51 to a29c41e Compare April 21, 2025 15:59

charleskawczynski requested review from Sbozzolo, imreddyTeja and ph-kev April 21, 2025 16:41

charleskawczynski force-pushed the ck/clima_cartesian_indices branch 2 times, most recently from 1b0e649 to 036606a Compare April 21, 2025 18:37

charleskawczynski force-pushed the ck/clima_cartesian_indices branch from 3f5bc86 to d5bb492 Compare April 22, 2025 13:29

charleskawczynski force-pushed the ck/clima_cartesian_indices branch 4 times, most recently from c544621 to c866b9f Compare April 26, 2025 17:27

charleskawczynski force-pushed the ck/clima_cartesian_indices branch from 347b75e to 23cef18 Compare April 28, 2025 18:05

Use ClimaCartesianIndices

4835656

Apply formatter Print more type info in JET tests Update deps Fixes

charleskawczynski force-pushed the ck/clima_cartesian_indices branch from 23cef18 to 4835656 Compare April 28, 2025 18:35

charleskawczynski mentioned this pull request Apr 28, 2025

Try ClimaCore with CartesianIndices CliMA/ClimaAtmos.jl#3787

Closed

akshaysridhar mentioned this pull request May 1, 2025

Disable shmem + update patch version > 0.14.32 #2317

Merged

Use ClimaCartesianIndices #2304

Are you sure you want to change the base?

Use ClimaCartesianIndices #2304

Uh oh!

Conversation

charleskawczynski commented Apr 21, 2025

Uh oh!

dennisYatunin commented Apr 21, 2025

Uh oh!

charleskawczynski commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dennisYatunin commented Apr 21, 2025

Uh oh!

charleskawczynski commented Apr 22, 2025

Uh oh!

charleskawczynski commented Apr 22, 2025

Uh oh!

charleskawczynski commented Apr 28, 2025

Uh oh!

charleskawczynski commented Apr 28, 2025

Uh oh!

Sbozzolo commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akshaysridhar commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akshaysridhar commented May 1, 2025

Uh oh!

charleskawczynski commented May 11, 2025

Uh oh!

akshaysridhar commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charleskawczynski commented May 13, 2025

Uh oh!

akshaysridhar commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charleskawczynski commented May 13, 2025

Uh oh!

charleskawczynski commented May 13, 2025

Uh oh!

Sbozzolo commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charleskawczynski commented May 13, 2025

Uh oh!

Uh oh!

charleskawczynski commented Apr 21, 2025 •

edited

Loading

Sbozzolo commented Apr 29, 2025 •

edited

Loading

akshaysridhar commented Apr 30, 2025 •

edited

Loading

akshaysridhar commented May 13, 2025 •

edited

Loading

akshaysridhar commented May 13, 2025 •

edited

Loading

Sbozzolo commented May 13, 2025 •

edited

Loading