-
Notifications
You must be signed in to change notification settings - Fork 7
Basic Synchronization Methods
Previous: Performance Experiment: Multi-stream Parallelism
So far, we have not taken an in-depth look at what happens when we use functions like cudaStreamSynchronize
and __syncthreads
, nor the full explanation as to why we used them when we have. In addition, there is another type of synchronization, device synchronization, which we have not yet seen at all but will soon become more familiar with.
In the reduction tutorial, we used a single block of threads to sum all of the elements of an array in parallel, and along the way, it was briefly mentioned that we must use __syncthreads
in order to ensure that one step is fully complete before the next begins.
While this is all true, it is not the full story. The purpose of __syncthreads
is to create a barrier for all of the threads within the block it is called from. Barriers serve to synchronize groups of asynchronous workers, causing one worker to wait on all of the other workers in the group to arrive at the barrier before continuing. In this case, the workers being synchronized are all of the warps within the block running the reduction, since instructions can only be given at the warp level, and the group is the set of all warps within the block. When placed after each iteration of a reduction, __syncthreads
ensures that all threads have completed processing that iteration and that all intermediate sums within the reduction have been completed before the next level of processing begins.
This type of barrier differs from other types of synchronization, such as stream synchronization, in that it only occurs inside a kernel. Blocks only exist over the lifespan of individual kernels, thus that is the only place where they can be synchronized.
It is also important to emphasize that the __syncthreads
function can only synchronize single blocks, not multiple blocks in a grid. Until recently, no explicit method existed to synchronize multiple blocks during a kernel's execution. The only solution for synchronizing blocks in the middle of a kernel was to divide the work into multiple separate kernels and run them sequentially. In CUDA 9.0, Cooperative Groups were introduced that allow multiple blocks to be synchronized during kernel execution, but that's a topic for another time.