Tuning `reorder123x321_kernel` #14

bartvstratum · 2021-06-16T11:53:08Z

I started on this kernel...

isazi · 2021-06-16T12:33:15Z

For this particular kernel, the order of dimensions may matter, because ii and ik are mapped to x and z:

const int idx_out = ii + ij*ni + ik*nj*ni;
const int idx_in = ik + ij*nk + ii*nj*nk;
arr_out[idx_out] = arr_in[idx_in];

In this case best memory bandwidth could be achieved by reading in shared memory, reordering, and then writing to output.

bartvstratum · 2021-06-16T14:00:21Z

Yes, I agree. So far I only did a simple block size tuning, which already makes the kernel almost a factor 2 faster: 951ebd3

isazi · 2021-06-16T14:31:59Z

Do you want to try shared memory, or should I do it?

isazi · 2021-06-16T14:46:40Z

I did actually write a first (dirty) implementation (in my test branch). I will test it tomorrow and let you know.

bartvstratum · 2021-06-16T14:49:28Z

Yes, please, go ahead :-)

isazi · 2021-06-17T12:18:49Z

Implemented the shared memory version. Performance is the same as the standard (tuned) version, most probably because the cache behavior is so good. So it does not make sense to use this implementation in practice right now. However, if in the future data size is going to increase, shared memory may be able to outperform the cache.

benvanwerkhoven · 2021-07-02T11:56:09Z

I was thinking that we could use the CUB library to make a fast version for this kernel: https://nvlabs.github.io/cub/classcub_1_1_block_exchange.html

CUB has all kinds of optimizations built-in to avoid memory bank conflicts for all kinds of devices, data types, and block sizes, and uses warp shuffle instructions where possible. If you can fit the use case that we have here onto such a block-wide primitive in CUB it's probably the best performing option and you can still tune things like items per thread and thread block dimensions.

bartvstratum self-assigned this Jun 16, 2021

bartvstratum changed the title ~~Tuning reorder123_321 kernel~~ Tuning reorder123x321_kernel Jun 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tuning `reorder123x321_kernel` #14

Tuning `reorder123x321_kernel` #14

bartvstratum commented Jun 16, 2021

isazi commented Jun 16, 2021

bartvstratum commented Jun 16, 2021

isazi commented Jun 16, 2021

isazi commented Jun 16, 2021

bartvstratum commented Jun 16, 2021

isazi commented Jun 17, 2021

benvanwerkhoven commented Jul 2, 2021

Tuning reorder123x321_kernel #14

Tuning reorder123x321_kernel #14

Comments

bartvstratum commented Jun 16, 2021

isazi commented Jun 16, 2021

bartvstratum commented Jun 16, 2021

isazi commented Jun 16, 2021

isazi commented Jun 16, 2021

bartvstratum commented Jun 16, 2021

isazi commented Jun 17, 2021

benvanwerkhoven commented Jul 2, 2021

Tuning `reorder123x321_kernel` #14

Tuning `reorder123x321_kernel` #14