-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify CUDA remapping #2156
base: main
Are you sure you want to change the base?
Simplify CUDA remapping #2156
Conversation
The CUDA kernels for remapping worked in spite of a bug: the kernels used a full 3D launch grid to distribute the work, but only a 1D grid was being launched. This commit changes three things: - it removes the additional two dimensions from the kernels. This is the same behavior as in `main`, except for all the dead code being removed - it reorders some for loops in the kernel so that the outermost loop is the field (in practice, this should not have consequence downstream because we are not using this feature yet) - it removes a duplicated `env` in a buildkite step
for k in 1:num_fields | ||
for i in hindex:totalThreadsX:num_horiz | ||
h = local_horiz_indices[i] | ||
for j in 1:num_vert | ||
v_lo, v_hi = vert_bounding_indices[j] | ||
A, B = vert_interpolation_weights[j] | ||
out[i, j, k] = 0 | ||
for t in 1:Nq, s in 1:Nq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the loop order, here, could have a notable impact on performance. I'm fine with first simplifying the ranges, and then maybe just update the threading pattern / parallelism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the order precisely because I expect this to be faster because retrieving values from different fields is not the most efficient memory access pattern, so I moved the field index to the outermost loop.
In my tests, this does not degrade performance (it improves them very slightly).
But I can remove this change from this PR if you prefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that makes sense. I still think that the best path forward is to improve the parallelism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #2159, to give an example of how I think this should be refactored, and how we should improve the parallelism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't unify the indices because doing what I did in this PR was an easy fix, whereas thinking about how to do indices properly would have required more work.
Also, it's not immediately clear to me that #2159 will be better because every thread has to read the various arrays that are passed in, whereas this PR reduces some of the reads because some values are reused (e.g., local_horiz_indices[i]
). The parallelization is over horizontal points, and for a typical output, this is order of 10^4-10^5 points, which should be plenty of threads to expose enough work to the GPU.
But if you want to go ahead and finish #2159, I'd be happy with it.
The CUDA kernels for remapping worked in spite of a bug: the kernels used a full 3D launch grid to distribute the work, but only a 1D grid was being launched.
This commit changes three things:
main
, except for all the dead code being removedenv
in a buildkite step