-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tuning reorder123x321_kernel
#14
Comments
For this particular kernel, the order of dimensions may matter, because
In this case best memory bandwidth could be achieved by reading in shared memory, reordering, and then writing to output. |
Yes, I agree. So far I only did a simple block size tuning, which already makes the kernel almost a factor 2 faster: 951ebd3 |
Do you want to try shared memory, or should I do it? |
I did actually write a first (dirty) implementation (in my test branch). I will test it tomorrow and let you know. |
Yes, please, go ahead :-) |
reorder123_321
kernelreorder123x321_kernel
Implemented the shared memory version. Performance is the same as the standard (tuned) version, most probably because the cache behavior is so good. So it does not make sense to use this implementation in practice right now. However, if in the future data size is going to increase, shared memory may be able to outperform the cache. |
I was thinking that we could use the CUB library to make a fast version for this kernel: https://nvlabs.github.io/cub/classcub_1_1_block_exchange.html CUB has all kinds of optimizations built-in to avoid memory bank conflicts for all kinds of devices, data types, and block sizes, and uses warp shuffle instructions where possible. If you can fit the use case that we have here onto such a block-wide primitive in CUB it's probably the best performing option and you can still tune things like items per thread and thread block dimensions. |
I started on this kernel...
The text was updated successfully, but these errors were encountered: