-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bitonic sort exceeds launch resources #2331
Comments
There seems to be some confusion about the two kernel's launch configurations in the current implementation, however naively fixing that introduces test failures: diff --git a/src/sorting.jl b/src/sorting.jl
index 7dd563831..70cd72e29 100644
--- a/src/sorting.jl
+++ b/src/sorting.jl
@@ -908,10 +908,12 @@ function bitonic_sort!(c; by = identity, lt = isless, rev = false, dims=1)
# compile kernels (using Int32 for indexing, if possible, yielding a 70% speedup)
I = c_len <= typemax(Int32) ? Int32 : Int
+
args1 = (c, I(c_len), one(I), one(I), one(I), by, lt, Val(rev), Val(dims))
kernel1 = @cuda launch=false comparator_small_kernel(args1...)
-
config1 = launch_configuration(kernel1.fun, shmem = threads -> bitonic_shmem(c, threads))
+ threads1 = config1.threads
+
args2 = (c, I(c_len), one(I), one(I), by, lt, Val(rev), Val(dims))
kernel2 = @cuda launch=false comparator_kernel(args2...)
config2 = launch_configuration(kernel2.fun, shmem = threads -> bitonic_shmem(c, threads))
@@ -940,11 +942,11 @@ function bitonic_sort!(c; by = identity, lt = isless, rev = false, dims=1)
pseudo_block_length = 1 << abs(j_final + 1 - j)
# N_pseudo_blocks = how many pseudo-blocks are in this layer of the network
N_pseudo_blocks = nextpow(2, c_len) ÷ pseudo_block_length
- pseudo_blocks_per_block = threads2 ÷ pseudo_block_length
+ pseudo_blocks_per_block = threads1 ÷ pseudo_block_length
# grid dimensions
N_blocks = max(1, N_pseudo_blocks ÷ pseudo_blocks_per_block), other_block_dims...
- block_size = pseudo_block_length, threads2 ÷ pseudo_block_length
+ block_size = pseudo_block_length, threads1 ÷ pseudo_block_length
kernel1(args1...; blocks=N_blocks, threads=block_size,
shmem=bitonic_shmem(c, block_size))
break @xaellison Since you most recently looked at this code, do you have the time to take a look? |
In #2338 I padded the kernel1 block size to a power of 2, but that seems to lead to @xaellison This is breaking lots of CI jobs, could you give this a quick look? |
fix #2353 |
CI has recently been showing lots of sorting-related errors, which I presume can be traced back to #2308. With #2329, we can see what's up:
The text was updated successfully, but these errors were encountered: