Simplify CUDA remapping #2156

Sbozzolo · 2025-01-24T00:34:37Z

The CUDA kernels for remapping worked in spite of a bug: the kernels used a full 3D launch grid to distribute the work, but only a 1D grid was being launched.

This commit changes three things:

it removes the additional two dimensions from the kernels. This is the same behavior as in main, except for all the dead code being removed
it reorders some for loops in the kernel so that the outermost loop is the field (in practice, this should not have consequence downstream because we are not using this feature yet)
it removes a duplicated env in a buildkite step

The CUDA kernels for remapping worked in spite of a bug: the kernels used a full 3D launch grid to distribute the work, but only a 1D grid was being launched. This commit changes three things: - it removes the additional two dimensions from the kernels. This is the same behavior as in `main`, except for all the dead code being removed - it reorders some for loops in the kernel so that the outermost loop is the field (in practice, this should not have consequence downstream because we are not using this feature yet) - it removes a duplicated `env` in a buildkite step

charleskawczynski · 2025-01-24T05:23:53Z

ext/cuda/remapping_distributed.jl

+    for k in 1:num_fields
+        for i in hindex:totalThreadsX:num_horiz
+            h = local_horiz_indices[i]
+            for j in 1:num_vert
+                v_lo, v_hi = vert_bounding_indices[j]
+                A, B = vert_interpolation_weights[j]
+                out[i, j, k] = 0
+                for t in 1:Nq, s in 1:Nq


Changing the loop order, here, could have a notable impact on performance. I'm fine with first simplifying the ranges, and then maybe just update the threading pattern / parallelism.

I changed the order precisely because I expect this to be faster because retrieving values from different fields is not the most efficient memory access pattern, so I moved the field index to the outermost loop.

In my tests, this does not degrade performance (it improves them very slightly).

But I can remove this change from this PR if you prefer

Ok, that makes sense. I still think that the best path forward is to improve the parallelism.

I opened #2159, to give an example of how I think this should be refactored, and how we should improve the parallelism.

I didn't unify the indices because doing what I did in this PR was an easy fix, whereas thinking about how to do indices properly would have required more work.

Also, it's not immediately clear to me that #2159 will be better because every thread has to read the various arrays that are passed in, whereas this PR reduces some of the reads because some values are reused (e.g., local_horiz_indices[i]). The parallelization is over horizontal points, and for a typical output, this is order of 10^4-10^5 points, which should be plenty of threads to expose enough work to the GPU.

But if you want to go ahead and finish #2159, I'd be happy with it.

Sbozzolo requested a review from charleskawczynski January 24, 2025 00:34

charleskawczynski reviewed Jan 24, 2025

View reviewed changes

charleskawczynski mentioned this pull request Jan 24, 2025

Improve threading in _set_interpolated_values_device #2159

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify CUDA remapping #2156

Simplify CUDA remapping #2156

Sbozzolo commented Jan 24, 2025

charleskawczynski Jan 24, 2025

Sbozzolo Jan 24, 2025

charleskawczynski Jan 24, 2025

charleskawczynski Jan 24, 2025

Sbozzolo Jan 24, 2025

Simplify CUDA remapping #2156

Are you sure you want to change the base?

Simplify CUDA remapping #2156

Conversation

Sbozzolo commented Jan 24, 2025

charleskawczynski Jan 24, 2025

Choose a reason for hiding this comment

Sbozzolo Jan 24, 2025

Choose a reason for hiding this comment

charleskawczynski Jan 24, 2025

Choose a reason for hiding this comment

charleskawczynski Jan 24, 2025

Choose a reason for hiding this comment

Sbozzolo Jan 24, 2025

Choose a reason for hiding this comment