-
Notifications
You must be signed in to change notification settings - Fork 7
Basic Synchronization Methods
Previous: Performance Experiment: Multi-stream Parallelism
So far, we have not taken an in-depth look at what happens when we use functions like cudaStreamSynchronize
and __syncthreads
, nor the full explanation as to why we used them when we have. In addition, there is another type of synchronization, device synchronization, which we have not yet seen at all but will soon become more familiar with.
In the reduction tutorial, we used a single block of threads to sum all of the elements of an array in parallel, and along the way, it was briefly mentioned that we must use __syncthreads
in order to ensure that one step is fully complete before the next begins.
While this is all true, it is not the full story. The purpose of __syncthreads
is to create a barrier for all of the threads within the block it is called from. Barriers serve to synchronize groups of asynchronous workers, causing one worker to wait on all of the other workers in the group to arrive at the barrier before continuing. In this case, the workers being synchronized are all of the warps within the block running the reduction, since instructions can only be given at the warp level, and the group is the set of all warps within the block. When placed after each iteration of a reduction, __syncthreads
ensures that all threads have completed processing that iteration and that all intermediate sums within the reduction have been completed before the next level of processing begins.
This type of barrier differs from other types of synchronization, such as stream synchronization, in that it only occurs inside a kernel. Blocks only exist over the lifespan of individual kernels, thus that is the only place where they can be synchronized.
It is also important to emphasize that the __syncthreads
function can only synchronize single blocks, not multiple blocks in a grid. Until recently, no explicit method existed to synchronize multiple blocks during a kernel's execution. The only solution for synchronizing blocks in the middle of a kernel was to divide the work into multiple separate kernels and run them sequentially. In CUDA 9.0, Cooperative Groups were introduced that allow multiple blocks to be synchronized during kernel execution, but that's a topic for another time.
So now that we have reviewed internal GPU synchronization methods, let's now take a look at the options for external synchronization provided by the CUDA API.
The first one you should already be somewhat familiar with, given that we have already used it in both the CUDA Streams and Multi-stream performance experiment articles previously: explicit stream synchronization.
The exact term used may differ from author to author, but for the purposes of our series, stream synchronization refers to the act of causing the default stream to wait until the work of one or more non-default streams has been completed.
Explicit stream synchronization refers to the use of cudaStreamSynchronize
to cause stream synchronization, rather than implicit causes of synchronization such as synchronous memory transfers or kernel launches placed after asynchronous events.
As in the case of block synchronization, stream synchronization creates a barrier at the point in the program that causes the synchronization, forcing the default stream to wait until the other stream has completed before continuing.
Note that throughout this section on stream synchronization we always mention the default stream as the stream that is waiting on other streams to complete. This is because the default stream is the only stream that can be made to synchronize with other streams in this way.
If two non-default streams need to be synchronized using explicit synchronization, then a call to cudaStreamSynchronize
would need to be initiated for each stream to make sure they are synchronized before the default stream can continue and add new tasks to the streams.
In CUDA 7, the runtime API expanded to include Events, which brought with them ways to indicate dependencies between two streams that do not require synchronization by the default stream in order to resolve. Events will be covered in detail in the next tutorial.
What should we do if we want all currently running streams to stop before entering a critical section in our code? One option would be to have the default stream synchronize with each stream sequentially until all streams reach the end of their execution, but there is another alternative: device synchronization.
Device synchronization causes all of the streams currently running, including the default stream, to become synchronized before progressing. Due to the fact that it is synchronizing every thread across the GPU, the cost of device synchronization with respect to time is high, but may have less total overhead than several repeated stream synchronization calls. This is something that we will look at in more detail in the synchronization methods performance experiment.
To initiate device synchronization, a single parameterless CUDA API function is used like so:
myKernel<<<gridSize, blockSize>>>();
// Waits on all running threads to complete
cudaDeviceSynchronize();
// Continue host execution...
Inserting that function call before a host operation ensures that every thread on the GPU has completed execution before continuing on. Because its use is so simple, we won't create a whole example just to show it off in this article, but it is part of the synchronization methods experiment mentioned earlier.