Skip to content

Reduction

rshipley160 edited this page Dec 10, 2021 · 3 revisions

Previous: Thread and Block Scheduling

A reduction is the process of applying an associative function to all elements in a set, combining all of the values within the set into a single value. The function used inside a reduction can be any function that follows the mathematical principal of associativity, meaning that the order in which the operands are evaluated does not change the end result.
Addition is an example of an associative function, because (4 + 2) + 6 = 4 + (2 + 6), but subtraction is not associative, because (4 - 2) - 6 ≠ 4 - (2 - 6). Other associative functions include multiplication, min, max, and the boolean operators AND, OR, and XOR to name a few. You can also use custom functions in your reductions as long as they follow associativity.

How Reductions Work

At its most basic, a reduction is completed by a group of threads which are operating on a set of shared elements. At each step of the reduction, half of the threads will apply the associative reduction function to their element in the array and another element in the array. The other element's position is determined by the orientation of the reduction, but regardless of the order in which the reduction is performed, the number of active elements remaining is halved after each pass of the reduction algorithm.

The image above shows the typical reduction process as it applies to threads within a single block: Each thread combines two elements, which are determined by the thread's ID, into a single element which is stored in the position of the left operand to the reduction function. The process then repeats with half as many threads as before until only a single thread is left, at which point the answer to the reduction is stored at element 0 in the array or other data structure the reduction was performed on.

The amount of elements between each thread's operands is known as the stride of the reduction. Stride is a function of both the array size and the number of iterations that have occurred at each stage in the algorithm. In the implementation shown above, which is the version shown in this tutorial, the stride is equal to half the number of active elements in each stage, meaning that a 16 element array will have a stride of 8 in its first pass, then a stride of 4 in its second pass, etc until the stride is 0, signaling that the reduction is complete.

One-block Reduction Implementation

Now that we've covered the concept of a reduction, let's put the idea into runnable CUDA code by implementing an addition reduction kernel that reduces an array using a single block of threads.

__global__ void block_add_reduce(int *array, int arraySize) {
    int id = threadIdx.x;

    for (int stride = blockDim.x; stride>0; stride >>= 1) {
        if (id < stride)
            array[id] += array[id + stride];

        __syncthreads();
    }
}

This is our entire block reduction kernel. If it seems simple, that's because it is. All we do is set our initial stride to be the number of threads in the block. Our driver code is going to set our block size to be half as many threads large as our array, so stride will also be half the number of elements in the array. Every iteration, we check to see if each thread is still active and use it to sum two values in the array separated by the stride. Stride then gets bit shifted to the right which essentially divides it by two.

The last line, __syncthreads(), synchronizes all of the threads that are in a single block of a kernel. We do this because, unlike warps, whole thread blocks are not guaranteed to run in lockstep and individual warps will often overlap one another in execution, which can cause a race condition if not handled appropriately. By synchronizing at the end of each iteration, we ensure that no thread advances to the next iteration of the loop before all of the threads have completed the current iteration.
If you're a bit confused by this talk of race conditions and synchronization, fear not! It should become a lot clearer in the next article on the basics of asynchronous execution.

Another thing to note about this kernel is that it is somewhat fragile - it only works for arrays of even sizes, and is limited to only a few hundred elements since it is bound by the number of threads that can fit in a single block. More robust reduction algorithms can and do exist, but there is much more to dig in to in those versions than could appropriately be covered at this level.

The rest of the one-block reduction program is pretty simple, so we won't go into detail here, but it can be found here if you want to run it for yourself.

Next: Matrix Multiplication