Skip to content

Use ClimaCartesianIndices #2304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

charleskawczynski
Copy link
Member

This PR adds the use of ClimaCartesianIndices.jl, which can yield better gpu performance for simple copyto! kernels.

This PR should move us from perf_cart_index!(X, Y, CartesianIndices(...))-like performance to perf_cart_index!(X, Y, fast_ci(...))-like performance for most datalayouts (all but those that end with fields, which exhibit perf_linear_index!-like performance).

@charleskawczynski charleskawczynski force-pushed the ck/clima_cartesian_indices branch from 8c80b51 to a29c41e Compare April 21, 2025 15:59
@charleskawczynski charleskawczynski force-pushed the ck/clima_cartesian_indices branch 2 times, most recently from 1b0e649 to 036606a Compare April 21, 2025 18:37
@dennisYatunin
Copy link
Member

Would it be possible to run ClimaAtmos CI with this change and measure the effect on SYPD for GPU runs? The simple array example you've linked looks promising but it would be good to measure a real-world example.

@charleskawczynski
Copy link
Member Author

charleskawczynski commented Apr 21, 2025

Would it be possible to run ClimaAtmos CI with this change and measure the effect on SYPD for GPU runs? The simple array example you've linked looks promising but it would be good to measure a real-world example.

Yes, absolutely. I have a few fixes pending. Assuming it’s strictly better, does this look alright?

I will say that I'm hopeful in that the simple example improves because cuda cannot hide the latency of the expensive integer division when there are few loads / operations. So, I think it should be cheaper for all cases.

@dennisYatunin
Copy link
Member

My only concern is that length(CI) > typemax(Int32) might be type-unstable and trigger allocations. If you think that's fine then I'm okay with these changes.

@charleskawczynski charleskawczynski force-pushed the ck/clima_cartesian_indices branch from 3f5bc86 to d5bb492 Compare April 22, 2025 13:29
@charleskawczynski
Copy link
Member Author

The copyto benchmarks show:

# (previous) Multi-dimensional launch configuration (this is not robust to resolution changes)
N reads-writes: 2, N-reps: 10000,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                    │ bw %       │ achieved bw │ problem size        │
├───────┼──────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 12 microseconds, 890 nanoseconds │ 5.66957e-50.00115602  │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 14 microseconds, 780 nanoseconds │ 4.2721187.1083     │ (4, 4, 1, 1, 5400)  │
│ IJHF  │ 13 microseconds, 570 nanoseconds │ 4.6530494.8755     │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 13 microseconds, 740 nanoseconds │ 1.1488723.4254     │ (4, 1, 1, 1, 5400)  │
│ IHF   │ 13 microseconds, 569 nanoseconds │ 1.1633523.7206     │ (4, 1, 1, 1, 5400)  │
│ VF    │ 12 microseconds, 900 nanoseconds │ 0.003569060.0727731   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 63 microseconds, 401 nanoseconds │ 62.74341279.34     │ (4, 4, 1, 63, 5400) │
│ VIJHF │ 62 microseconds, 789 nanoseconds │ 63.35391291.79     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 20 microseconds, 230 nanoseconds │ 49.16121002.4      │ (4, 1, 1, 63, 5400) │
│ VIHF  │ 18 microseconds, 390 nanoseconds │ 54.07741102.64     │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴────────────┴─────────────┴─────────────────────┘
CartesianIndices (main branch)
N reads-writes: 2, N-reps: 10000,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────┬───────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                     │ bw %       │ achieved bw │ problem size        │
├───────┼───────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 14 microseconds, 751 nanoseconds  │ 4.95463e-50.00101025  │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 16 microseconds, 860 nanoseconds  │ 3.7450676.3618     │ (4, 4, 1, 1, 5400)  │
│ IJHF  │ 13 microseconds, 679 nanoseconds  │ 4.6159694.1195     │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 15 microseconds, 241 nanoseconds  │ 1.0357921.1198     │ (4, 1, 1, 1, 5400)  │
│ IHF   │ 13 microseconds, 451 nanoseconds  │ 1.1736423.9305     │ (4, 1, 1, 1, 5400)  │
│ VF    │ 12 microseconds, 791 nanoseconds  │ 0.003599750.073399    │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 117 microseconds, 230 nanoseconds │ 33.933691.894     │ (4, 4, 1, 63, 5400) │
│ VIJHF │ 62 microseconds, 601 nanoseconds  │ 63.54521295.69     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 36 microseconds, 520 nanoseconds  │ 27.2312555.244     │ (4, 1, 1, 63, 5400) │
│ VIHF  │ 18 microseconds, 70 nanoseconds   │ 55.03811122.23     │ (4, 1, 1, 63, 5400) │
└───────┴───────────────────────────────────┴────────────┴─────────────┴─────────────────────┘
FastCartesianIndices (this branch)
N reads-writes: 2, N-reps: 10000,  Float_type = Float64, Device_bandwidth_GBs=2039
┌───────┬──────────────────────────────────┬────────────┬─────────────┬─────────────────────┐
│ funcs │ time per call                    │ bw %       │ achieved bw │ problem size        │
├───────┼──────────────────────────────────┼────────────┼─────────────┼─────────────────────┤
│ DataF │ 15 microseconds, 41 nanoseconds  │ 4.85909e-50.000990769 │ (1, 1, 1, 1, 1)     │
│ IJFH  │ 16 microseconds, 580 nanoseconds │ 3.8083177.6514     │ (4, 4, 1, 1, 5400)  │
│ IJHF  │ 13 microseconds, 661 nanoseconds │ 4.6223894.2504     │ (4, 4, 1, 1, 5400)  │
│ IFH   │ 15 microseconds, 659 nanoseconds │ 1.0080720.5546     │ (4, 1, 1, 1, 5400)  │
│ IHF   │ 13 microseconds, 570 nanoseconds │ 1.1632623.7189     │ (4, 1, 1, 1, 5400)  │
│ VF    │ 13 microseconds, 481 nanoseconds │ 0.003415490.0696419   │ (1, 1, 1, 63, 1)    │
│ VIJFH │ 87 microseconds, 241 nanoseconds │ 45.5976929.734     │ (4, 4, 1, 63, 5400) │
│ VIJHF │ 63 microseconds, 19 nanoseconds  │ 63.12271287.07     │ (4, 4, 1, 63, 5400) │
│ VIFH  │ 27 microseconds, 370 nanoseconds │ 36.3348740.866     │ (4, 1, 1, 63, 5400) │
│ VIHF  │ 18 microseconds, 179 nanoseconds │ 54.7051115.44     │ (4, 1, 1, 63, 5400) │
└───────┴──────────────────────────────────┴────────────┴─────────────┴─────────────────────┘

So, we recover 25% performance loss from our reverting from multi-dimensional indices, which agrees with what I saw in the ClimaCartesianIndices docs. We can get another ~15-20% if we passed the indices through with Val, but we don't have Nh in the type space, so that would be type-unstable / dynamic allocation.

I'll try running ClimaAtmos next, and compare against the main branch.

@charleskawczynski
Copy link
Member Author

My only concern is that length(CI) > typemax(Int32) might be type-unstable and trigger allocations. If you think that's fine then I'm okay with these changes.

Agreed. I changed it to a type assert and always use FastCartesianIndices. I don't think that we'll need more than 2 billion points per process (I'm not sure that we can even make a space that big)

@charleskawczynski charleskawczynski force-pushed the ck/clima_cartesian_indices branch 4 times, most recently from c544621 to c866b9f Compare April 26, 2025 17:27
@charleskawczynski
Copy link
Member Author

Let's just make this a direct dependency.

@charleskawczynski charleskawczynski force-pushed the ck/clima_cartesian_indices branch from 347b75e to 23cef18 Compare April 28, 2025 18:05
Apply formatter

Print more type info in JET tests

Update deps

Fixes
@charleskawczynski
Copy link
Member Author

I think these two builds are fair to compare (here I'm comparing the dry baro wave):

We should probably be doing a comparison on the A100 (though I posted microbenchmarks on an A100 in the ClimaCartesianIndices docs), but it's at least in agreement with microbenchmarks and shows that there is some benefit for real-world cases.

cc @dennisYatunin, @Sbozzolo

@Sbozzolo
Copy link
Member

Sbozzolo commented Apr 29, 2025

Here's a build on A100 with ClimaCore 0.14.30
https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/418

1 GPU dry baro: 3.583
4 GPU dry baro: 10.728
1 GPU Moist HS: 1.856
4 GPU Moist HS: 5.616
1 GPU Aquaplanet: 0.694
2 GPU Aquaplanet: 1.358

Here's one with 0.14.31
https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/419

1 GPU dry baro: 3.461
4 GPU dry baro: 10.564
1 GPU Moist HS: 1.801
4 GPU Moist HS: 5.745
1 GPU Aquaplanet: 0.503
2 GPU Aquaplanet: 1.005

Here's one with ClimaCartesianIndices:
https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/420

1 GPU dry baro: 3.434
4 GPU dry baro: 10.646
1 GPU Moist HS: 1.755
4 GPU Moist HS: 5.519
1 GPU Aquaplanet: 0.504
2 GPU Aquaplanet: 1.005

@akshaysridhar
Copy link
Member

akshaysridhar commented Apr 30, 2025

I think these two builds are fair to compare (here I'm comparing the dry baro wave):

* https://buildkite.com/clima/climaatmos-ci/builds/24041#01967d72-2013-4b68-916f-0b19dccbbf0a (main branch, sypd=68.483, P100)

* https://buildkite.com/clima/climaatmos-ci/builds/24043#01967db1-4b79-4c59-aad5-794c3183d862 (this branch, sypd=78.787, P100)

We should probably be doing a comparison on the A100 (though I posted microbenchmarks on an A100 in the ClimaCartesianIndices docs), but it's at least in agreement with microbenchmarks and shows that there is some benefit for real-world cases.

cc @dennisYatunin, @Sbozzolo

A100 estimates are much more reliable, I agree e.g. On the P100 central queue, 68.48 SYPD in the linked branch above for the dry barowave problem, vs 83.905 SYPD in the latest RRTMGP interface update PR 3786 (Both these use RRTMGP 0.21.2 and ClimaCore 0.14.31).

@akshaysridhar
Copy link
Member

https://gist.github.com/akshaysridhar/fca4d50242a8bcc65faa8289b03cdc8f shows the target-gpu timings across 4 older commits between 0.14.30 and 0.14.31.

@charleskawczynski
Copy link
Member Author

Here's a build on A100 with ClimaCore 0.14.30 https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/418

The 0.14.31 also included upgraded dependencies for a bunch of other packages, including SciMLBase, GPUCompiler and a Tracy extension. Are you sure that this is a fair comparison? The tracy extension, which @kmdeck and I observed showed up in the nsight report, very well may have added overhead to those jobs. Did anyone look into this?

Reporting sypd across versions can help identify if regressions occur, but it doesn't explain why. Has anyone looked at the nsight reports? The build @Sbozzolo posted seems to have a bunch of canceled jobs, so it's not clear to me where those numbers are coming from. I'd suggest we report links to specific jobs with nsight reports, so that we can identify what exactly has slowed down.

@akshaysridhar
Copy link
Member

akshaysridhar commented May 13, 2025

Here's a build on A100 with ClimaCore 0.14.30 https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/418

The 0.14.31 also included upgraded dependencies for a bunch of other packages, including SciMLBase, GPUCompiler and a Tracy extension. Are you sure that this is a fair comparison? The tracy extension, which @kmdeck and I observed showed up in the nsight report, very well may have added overhead to those jobs. Did anyone look into this?

Reporting sypd across versions can help identify if regressions occur, but it doesn't explain why. Has anyone looked at the nsight reports? The build @Sbozzolo posted seems to have a bunch of canceled jobs, so it's not clear to me where those numbers are coming from. I'd suggest we report links to specific jobs with nsight reports, so that we can identify what exactly has slowed down.

I think it is a reasonable comparison (w.r.t additional noise from unrelated commits I mean) the buildkite IDs I've reported are a sampling of commits between the two releases between which the degradation was noticed. I'm not aware of the Tracy extension but we can take a look at this shortly. It's possible that identifying issues from these updated dependencies does indeed help regain some performance improvements - but this list attempts to simply modify the ClimaCore versions against [b2c96348] ClimaAtmos v0.30.0 It seems that, with these updated dependencies listed above, disabling fd shmem helped recover a significant portion of the SYPD degradation. (It's worth re-running an updated version of this branch against ClimaAtmos main in any case for reference - I'll start that job shortly).

@charleskawczynski
Copy link
Member Author

It seems that, with these updated dependencies listed above, disabling fd shmem helped recover a significant portion of the SYPD degradation. (It's worth re-running an updated version of this branch against ClimaAtmos main in any case for reference - I'll start that job shortly).

Yeah, disabling shmem will improve performance for low vertical resolution jobs, but it will degrade performance for high-resolution jobs. The performance regression came from switching from a multi-dimensional launch configuration to a linear one + CartesianIndices. The shmem difference is orthogonal to versions 0.14.30 and 0.14.31.

@akshaysridhar
Copy link
Member

akshaysridhar commented May 13, 2025

It seems that, with these updated dependencies listed above, disabling fd shmem helped recover a significant portion of the SYPD degradation. (It's worth re-running an updated version of this branch against ClimaAtmos main in any case for reference - I'll start that job shortly).

Yeah, disabling shmem will improve performance for low vertical resolution jobs, but it will degrade performance for high-resolution jobs. The performance regression came from switching from a multi-dimensional launch configuration to a linear one + CartesianIndices. The shmem difference is orthogonal to versions 0.14.30 and 0.14.31.

Noted, thanks. The strong-scaling dyamond jobs in the linked Gist above all use 63 layers (64 levels) so comparisons across those jobs are reasonable. What's considered low-res vertical resolution here - something like less than ~20 vertical elements ?

@charleskawczynski
Copy link
Member Author

It's really difficult to reason about the results of the gist.

@charleskawczynski
Copy link
Member Author

And yeah, 20 levels is low resolution

@Sbozzolo
Copy link
Member

Sbozzolo commented May 13, 2025

It seems that, with these updated dependencies listed above, disabling fd shmem helped recover a significant portion of the SYPD degradation. (It's worth re-running an updated version of this branch against ClimaAtmos main in any case for reference - I'll start that job shortly).

Yeah, disabling shmem will improve performance for low vertical resolution jobs, but it will degrade performance for high-resolution jobs. The performance regression came from switching from a multi-dimensional launch configuration to a linear one + CartesianIndices. The shmem difference is orthogonal to versions 0.14.30 and 0.14.31.

Could you share builds where you see performance increase due to shared memory in an AMIP/Aquaplanet setup?

What we observed is that disabling shared memory makes everything that is not baro wave/held suarez faster. We see this acorss multiple pipelines, the nightly AMIP (39 vertical elements, 5 % faster without shared memory), the benchmark AMIP in ClimaCoupler (63 elements, 30 % faster without shared memory), the gpu_aquaplanet case in the atmos target gpu pipeline (63 elements, 20-30 % faster without shared memory).

@charleskawczynski
Copy link
Member Author

Could you share builds where you see performance increase due to shared memory in an AMIP/Aquaplanet setup?

What we observed is that disabling shared memory makes everything that is not baro wave/held suarez faster. We see this acorss multiple pipelines, the nightly AMIP (39 vertical elements, 5 % faster without shared memory), the benchmark AMIP in ClimaCoupler (63 elements, 30 % faster without shared memory), the gpu_aquaplanet case in the atmos target gpu pipeline (63 elements, 20-30 % faster without shared memory).

I'm more interested to know of any kernels that are slower with shmem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants