Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify CUDA remapping #2156

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Simplify CUDA remapping #2156

wants to merge 1 commit into from

Conversation

Sbozzolo
Copy link
Member

The CUDA kernels for remapping worked in spite of a bug: the kernels used a full 3D launch grid to distribute the work, but only a 1D grid was being launched.

This commit changes three things:

  • it removes the additional two dimensions from the kernels. This is the same behavior as in main, except for all the dead code being removed
  • it reorders some for loops in the kernel so that the outermost loop is the field (in practice, this should not have consequence downstream because we are not using this feature yet)
  • it removes a duplicated env in a buildkite step

The CUDA kernels for remapping worked in spite of a bug: the kernels
used a full 3D launch grid to distribute the work, but only a 1D grid
was being launched.

This commit changes three things:
- it removes the additional two dimensions from the kernels. This is the
same behavior as in `main`, except for all the dead code being removed
- it reorders some for loops in the kernel so that the outermost loop is
the field (in practice, this should not have consequence downstream
because we are not using this feature yet)
- it removes a duplicated `env` in a buildkite step
Comment on lines +79 to +86
for k in 1:num_fields
for i in hindex:totalThreadsX:num_horiz
h = local_horiz_indices[i]
for j in 1:num_vert
v_lo, v_hi = vert_bounding_indices[j]
A, B = vert_interpolation_weights[j]
out[i, j, k] = 0
for t in 1:Nq, s in 1:Nq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the loop order, here, could have a notable impact on performance. I'm fine with first simplifying the ranges, and then maybe just update the threading pattern / parallelism.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the order precisely because I expect this to be faster because retrieving values from different fields is not the most efficient memory access pattern, so I moved the field index to the outermost loop.

In my tests, this does not degrade performance (it improves them very slightly).

But I can remove this change from this PR if you prefer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense. I still think that the best path forward is to improve the parallelism.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #2159, to give an example of how I think this should be refactored, and how we should improve the parallelism.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't unify the indices because doing what I did in this PR was an easy fix, whereas thinking about how to do indices properly would have required more work.

Also, it's not immediately clear to me that #2159 will be better because every thread has to read the various arrays that are passed in, whereas this PR reduces some of the reads because some values are reused (e.g., local_horiz_indices[i]). The parallelization is over horizontal points, and for a typical output, this is order of 10^4-10^5 points, which should be plenty of threads to expose enough work to the GPU.

But if you want to go ahead and finish #2159, I'd be happy with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants