-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scalar reduction codegen schedule #1284
base: main
Are you sure you want to change the base?
Conversation
5009a3f
to
6fc8883
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
aee2026
to
81aa949
Compare
* shm[tid] += inputs[j] + inputs[j + block_size]; | ||
* } | ||
* __syncthreads(); | ||
* for (int stride = block_size / 2; stride > 0; stride /= 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here missing the logic of warpReduce.
} | ||
{ | ||
SmallVector<Value, 4> init_values = {}; | ||
for (int stride = 128; stride > 16; stride /= 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warp_size=32, it is better to set the stop condition to stride > 32
b.create<memref::LoadOp>(loc, shared_mem_map[root_op], strid_tid); | ||
Value sum = accum_factory[idx](shm_val_1, shm_val_2); | ||
b.create<memref::StoreOp>(loc, sum, shared_mem_map[root_op], tid); | ||
b.create<gpu::BarrierOp>(loc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BarrierOp is not necessary, threads in a warp are synchronized all the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrite the wrap reduction section with shuffle inst, and will update this PR later.
/*hasElseRegion*/ false); | ||
b.setInsertionPointToStart(&if_tid_valid_op.getThenRegion().front()); | ||
SmallVector<Value, 4> yield_values; | ||
for (int stride = 16; stride > 0; stride /= 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start with stride = 32
.
add scalar-reduction codegen template , the algorithm comes from https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf