Skip to content

CUDA Streams

rshipley160 edited this page Dec 3, 2021 · 9 revisions

Previous: Intro to Asynchronous Computing

PLEASE READ: This article is a work in progress and is incomplete and somewhat incomprehensible at the moment. Proceed with caution!

In the last article, we went over the basic concept of asynchronous computing, most importantly that it allows the GPU to perform multiple tasks in parallel with one another.

In this article, we're going to take advantage of asynchronous processing to perform multiple matrix multiplication operations in parallel and observe the speedup we get from doing so.

Problem Setup

For this problem, we will use the same matrix multiplication kernel introduced in the matrix multiplication tutorial. In case you didn't read that tutorial or need a refresher, here are the basics of the algorithm we're going to be using:

  • The kernel is launched using 2D blocks and grids
  • Each thread corresponds to a single element in the output matrix
  • The x and y index of each thread are used to the row and column of the output
  • That in turn determine the input values to use
  • Each thread computes and stores the dot product of each individual row/column combination

Synchronous Execution

Because the purpose of this tutorial is to highlight the advantages that asynchronous execution can offer, in addition to how it can be achieved, our example is going to include two separate matrix multiplication problems, so our synchronous version will be very similar to the original matrix multiplication example, just with double the inputs and outputs and double the kernel launches.

We will also be adding in events like we have done before so that we can capture the amount of time the synchronous version of the function takes.

Let's also wrap this whole thing up in a function that returns a float so that we can have an independent function that times itself and can return the time it takes, rather than trying to do all of that from the program driver directly.

float syncDoubleMatrix(int rows, int cols, int vectorLength) {

}

Inside of this function, we will first allocate two input matrices and an output matrix for each multiplication problem.

float syncDoubleMatrix(int rows, int cols, int vectorLength) {
    float *A, *B, *C; // A and B are inputs, C is output
    cudaMalloc(&A, sizeof(float)*rows*vector_length);
    cudaMalloc(&B, sizeof(float)*cols*vector_length);
    cudaMalloc(&C, sizeof(float)*rows*cols);

    float *A2, *B2, *C2; // A2 and B2 are inputs, C2 is output
    cudaMalloc(&A2, sizeof(float)*rows*vector_length);
    cudaMalloc(&B2, sizeof(float)*cols*vector_length);
    cudaMalloc(&C2, sizeof(float)*rows*cols);

Then we simply fill the input matrices and launch our two kernels as normal. Well, almost. In this case, because we are wanting to compare timing, we want to run the process two times: once before timing to prevent cold-start issues from affecting our output time, and once when we actually use the events to time the operation.
Additionally, we have added a call to cudaStreamSynchronize to make the CPU waits on the fill operations and the dry run of the matrix multiplication before starting the timer and running the real test to ensure the other operations don't affect our times. This will be explained in more depth soon.

    // Fill input matrices with 1s
    matrix_fill<<<dim3(rows / BLOCK_LENGTH + 1, vector_length / BLOCK_LENGTH + 1), dim3(BLOCK_LENGTH, BLOCK_LENGTH)>>>(A, rows, vector_length, 1);
    matrix_fill<<<dim3(vector_length / BLOCK_LENGTH + 1, cols/ BLOCK_LENGTH + 1), dim3(BLOCK_LENGTH, BLOCK_LENGTH)>>>(B, vector_length, cols, 1);

    matrix_fill<<<dim3(rows / BLOCK_LENGTH + 1, vector_length / BLOCK_LENGTH + 1), dim3(BLOCK_LENGTH, BLOCK_LENGTH)>>>(A2, rows, vector_length, 1);
    matrix_fill<<<dim3(vector_length / BLOCK_LENGTH + 1, cols/ BLOCK_LENGTH + 1), dim3(BLOCK_LENGTH, BLOCK_LENGTH)>>>(B2, vector_length, cols, 1);

    // Do a dry run to warm up the GPU
    matrix_mul<<<dim3(rows / BLOCK_LENGTH + 1, cols / BLOCK_LENGTH + 1)>>>(A, B, C, rows, cols, vector_length);
    matrix_mul<<<dim3(rows / BLOCK_LENGTH + 1, cols / BLOCK_LENGTH + 1)>>>(A2, B2, C2, rows, cols, vector_length);

    // Wait on fill operations to complete before timing
    cudaStreamSynchronize(0);

    // Start the timer and do it for real
    matrix_mul<<<dim3(rows / BLOCK_LENGTH + 1, cols / BLOCK_LENGTH + 1)>>>(A, B, C, rows, cols, vector_length);
    matrix_mul<<<dim3(rows / BLOCK_LENGTH + 1, cols / BLOCK_LENGTH + 1)>>>(A2, B2, C2, rows, cols, vector_length);

    // Wait on timed operations to complete before stopping the timer
    cudaStreamSynchronize(0);
}

Asynchronous Execution Using Streams

So now we have the basic function that we want to make asynchronous, but how do we actually go about making sure that the second kernel runs parallel to the first?

The answer to this lies in streams. In CUDA nomenclature, streams refer to individual lines of execution that each accomplish a sequential set of tasks independently of one another. By separating tasks into separate streams, we allow them to happen concurrently, which is how we enable our programs to become asynchronous. If one or more of our tasks do depend on other tasks, we can put those tasks in the same stream, because all of the tasks in any individual stream are still executed sequentially.

For this reason, a program is not asynchronous just because it uses a stream. In fact, we have already demonstrated this in all of our prior programs: though you likely did not know it at the time, all of those programs were running in stream 0, CUDA's default stream. Unless explicitly specified, all GPU operations in a CUDA program run on stream 0, which causes them to be executed sequentially. To take advantage of the asynchronous execution CUDA offers, we need to create a new stream to run our asynchronous tasks in.

Let's start by copying the contents of our syncDoubleMatrix function into a new asyncDoubleMatrix function:

float asyncDoubleMatrix(int rows, int cols, int vectorLength) {
    float *A, *B, *C; // A and B are inputs, C is output
    cudaMalloc(&A, sizeof(float)*rows*vector_length);
    cudaMalloc(&B, sizeof(float)*cols*vector_length);
    cudaMalloc(&C, sizeof(float)*rows*cols);

    float *A2, *B2, *C2; // A2 and B2 are inputs, C2 is output
    cudaMalloc(&A2, sizeof(float)*rows*vector_length);
    cudaMalloc(&B2, sizeof(float)*cols*vector_length);
    cudaMalloc(&C2, sizeof(float)*rows*cols);

    // Fill input matrices with 1s
    matrix_fill<<<dim3(rows / BLOCK_LENGTH + 1, vector_length / BLOCK_LENGTH + 1), dim3(BLOCK_LENGTH, BLOCK_LENGTH)>>>(A, rows, vector_length, 1);
    matrix_fill<<<dim3(vector_length / BLOCK_LENGTH + 1, cols/ BLOCK_LENGTH + 1), dim3(BLOCK_LENGTH, BLOCK_LENGTH)>>>(B, vector_length, cols, 1);

    matrix_fill<<<dim3(rows / BLOCK_LENGTH + 1, vector_length / BLOCK_LENGTH + 1), dim3(BLOCK_LENGTH, BLOCK_LENGTH)>>>(A2, rows, vector_length, 1);
    matrix_fill<<<dim3(vector_length / BLOCK_LENGTH + 1, cols/ BLOCK_LENGTH + 1), dim3(BLOCK_LENGTH, BLOCK_LENGTH)>>>(B2, vector_length, cols, 1);

    // Do a dry run to warm up the GPU
    matrix_mul<<<dim3(rows / BLOCK_LENGTH + 1, cols / BLOCK_LENGTH + 1)>>>(A, B, C, rows, cols, vector_length);
    matrix_mul<<<dim3(rows / BLOCK_LENGTH + 1, cols / BLOCK_LENGTH + 1)>>>(A2, B2, C2, rows, cols, vector_length);

    // Wait on previous operations to complete before starting the timer
    cudaStreamSynchronize(0);

    // Start the timer and do it for real    
    matrix_mul<<<dim3(rows / BLOCK_LENGTH + 1, cols / BLOCK_LENGTH + 1)>>>(A, B, C, rows, cols, vector_length);
    matrix_mul<<<dim3(rows / BLOCK_LENGTH + 1, cols / BLOCK_LENGTH + 1)>>>(A2, B2, C2, rows, cols, vector_length);

    // Wait on timed operations to complete before stopping the timer
    cudaStreamSynchronize(0);
}

Now all we have to do is create a second stream for the second kernel to run on, and then tell it to actually run on that stream. We also have to do some cleanup afterwards but we'll get to that later.

    cudaStream_t secondStream;
    cudaStreamCreate(&secondStream);

These two lines are the first steps to using streams in any CUDA program. Before a stream is valid, it must be declared as type cudaStream_t and then it must be initialized using cudaStreamCreate. Failing to initialize a stream before attempting to run a task on it will likely cause your program to crash.

With our stream initialized, we can make two other minor modifications in order to get our asynchronous function up and running.
In both launches of our secondary multiplication kernel, we need add two more parameters within the triple angled brackets:

matrix_mul<<<dim3(rows / BLOCK_LENGTH + 1, cols / BLOCK_LENGTH + 1), 0, secondStream>>>(A2, B2, C2, rows, cols, vector_length);

The third value, 0, sets the amount of shared memory we would like the kernel to use, in bytes. Since our array fill kernel does not use shared memory, we can set this to 0.

The fourth value is where we can specify the stream that we want the kernel to run on. By default, every kernel runs on stream 0, but you can designate the kernel to run on any other stream you have created simply by passing the intended stream as the fourth parameter to the kernel configuration, in this case secondStream.

Stream Cleanup

One other change we have to make is the addition of another call to cudaStreamSynchronize right after the first one:

// Wait for streams 
cudaStreamSynchronize(0);
cudaStreamSynchronize(secondStream);

By issuing a call to synchronize with another stream, we can cause the CPU to wait on another stream of our choice (the stream passed as a parameter to cudaStreamSynchronize) to complete its current line of tasks before continuing.

In the synchronous function, we synchronize the CPU with stream 0 because we want to ensure we record clean execution times, and because we want to ensure that everything completes before exiting the function. Thus, we synchronize with the default stream, which is the stream that our kernels are executing in, before letting the CPU act on its next instruction.

In the asynchronous function, we repeat this process with the second stream for the same reasons: to ensure that the previously issued tasks complete before proceeding to the timed portion, and to ensure that all of its tasks are complete before stopping the timer.

In both cases, the two entities being synchronized are the CPU and a GPU stream; there is no way to explicitly synchronize two GPU streams using the stream synchronize function. If stream synchronization is to be done using the cudaStreamSynchronize function, both streams must be synchronized with the host before continuing. Soon we will cover methods that are more efficient than stream synchronization which do allow for direct stream-to-stream synchronization.

Synchronizations can cause very large delays that eat GPU and CPU time if you are not careful, so you should try to use them as little as possible and to wait until the last viable moment to do so when necessary.

Finally, at the very end of our asynchronous function, we need to add one more API call. This time, we are going to be deallocating our stream:

cudaStreamDestroy(secondStream);

Just as you allocate and deallocate memory using malloc and free, so too are streams allocated via cudaStreamCreate and cudaStreamDestroy. Your programs will run without deallocating streams just as they will without deallocating memory, but that does not make it any less important to do so in either case.

Adding these four changes (stream creation & initialization, modified kernel configuration, stream synchronization, and stream deallocation), we have successfully converted our synchronous double matrix multiplier into an asynchronous one, which will hopefully be able to do the task twice as fast!

Timing GPU Functions Using Events

While we have mentioned timing these two matrix multiplication functions to see how they perform against one another, we have not yet actually implemented any measures to do so. C and C++ do have timing functions, but they work with respect to the CPU, not the GPU, which can cause them to give inaccurate readings for GPU events.

Instead, we are going to use CUDA's Event objects to record the time it takes for both our synchronous and asynchronous functions to complete.

Events can be used in a few different contexts, which we will explore further on in this series, but in all of them they are used as markers for specific points in time where something interesting has happened. In addition to marking a specific point in time, each event can also be tied to a specific stream, which can be used to allow each stream to more easily sort through which events it cares about, and which ones it does not.

Here we are going to be using events solely to time the beginning and end of our matrix multiplication functions. After both of the events have been recorded, we can then calculate the amount of time that elapses between those two event markers to determine how long each function takes.

To see how events work, let's add them to the async_twinArrayFill function we just completed:

float asyncDoubleMatrix(int rows, int cols, int vectorLength) {
    float *A, *B, *C; // A and B are inputs, C is output
    cudaMalloc(&A, sizeof(float)*rows*vector_length);
    cudaMalloc(&B, sizeof(float)*cols*vector_length);
    cudaMalloc(&C, sizeof(float)*rows*cols);

    float *A2, *B2, *C2; // A2 and B2 are inputs, C2 is output
    cudaMalloc(&A2, sizeof(float)*rows*vector_length);
    cudaMalloc(&B2, sizeof(float)*cols*vector_length);
    cudaMalloc(&C2, sizeof(float)*rows*cols);

    cudaEvent_t clockStart, clockStop;
    cudaEventCreate(&clockStart);
    cudaEventCreate(&clockStop);

    // Fill input matrices with 1s
    matrix_fill<<<dim3(rows / BLOCK_LENGTH + 1, vector_length / BLOCK_LENGTH + 1), dim3(BLOCK_LENGTH, BLOCK_LENGTH)>>>(A, rows, vector_length, 1);
    matrix_fill<<<dim3(vector_length / BLOCK_LENGTH + 1, cols/ BLOCK_LENGTH + 1), dim3(BLOCK_LENGTH, BLOCK_LENGTH)>>>(B, vector_length, cols, 1);

    matrix_fill<<<dim3(rows / BLOCK_LENGTH + 1, vector_length / BLOCK_LENGTH + 1), dim3(BLOCK_LENGTH, BLOCK_LENGTH)>>>(A2, rows, vector_length, 1);
    matrix_fill<<<dim3(vector_length / BLOCK_LENGTH + 1, cols/ BLOCK_LENGTH + 1), dim3(BLOCK_LENGTH, BLOCK_LENGTH)>>>(B2, vector_length, cols, 1);

    // Do a dry run to warm up the GPU
    matrix_mul<<<dim3(rows / BLOCK_LENGTH + 1, cols / BLOCK_LENGTH + 1)>>>(A, B, C, rows, cols, vector_length);
    matrix_mul<<<dim3(rows / BLOCK_LENGTH + 1, cols / BLOCK_LENGTH + 1)>>>(A2, B2, C2, rows, cols, vector_length);

    // Wait on previous operations to complete before starting the timer
    cudaStreamSynchronize(0);

    // Start the timer and do it for real    
    cudaEventRecord(clockStart);
    
       matrix_mul<<<dim3(rows / BLOCK_LENGTH + 1, cols / BLOCK_LENGTH + 1)>>>(A, B, C, rows, cols, vector_length);
       matrix_mul<<<dim3(rows / BLOCK_LENGTH + 1, cols / BLOCK_LENGTH + 1)>>>(A2, B2, C2, rows, cols, vector_length);

       // Wait on timed operations to complete before stopping the timer
       cudaStreamSynchronize(0);

    cudaEventRecord(clockStop);
    cudaEventSynchronize(clockStop);

    float timeElpased;
    cudaEventElapsedTime(&timeElapsed, clockStart, clockStop);
}

As with streams, the first thing we have to do is declare our cudaEvent_t objects and initialize them with cudaEventCreate to actually allocate an object to use any time we reference our events. In this case our events are named clockStart and clockStop because we will be using them to signal when we want to start and stop the timer for our memory transfers.

    cudaEvent_t clockStart, clockStop;
    cudaEventCreate(&clockStart);
    cudaEventCreate(&clockStop);

Next we mark the point in the execution of the stream that we want each event to be recorded as well as the stream we want each to be recorded under. In this case, we pass 0 as the second parameter to cudaEventRecord because we want both of the events to be recorded at points during the execution of the default stream.
If we wanted the events to be recorded in a different stream, we would pass that stream as the second parameter to the record function instead of 0.

    cudaEventRecord(clockStart, 0);

        array_fill_1D<<<gridSize, BLOCK_SIZE, 0, stream>>>(d_array, arraySize, value);

        for (int i=0; i<arraySize; i++)
            h_array[i] = value;

        cudaStreamSynchronize(stream);

    cudaEventRecord(clockStop, 0);

It's important to note that the recording of clockStop happens after both streams have been synchronized. This forces us to wait until both of the streams have finished processing. If we were to instead record clockStop in stream stream before the synchronization, it is likely that the time we record would be faster than the true time it takes to perform both copies, since the only operation we can be sure has completed is the array fill kernel and not the for loop on the CPU. Similarly, recording clockStop on stream 0 before the synchronization would only ensure that the CPU code has completed and may record the ending time before the kernel has completed. Only recording on stream 0 after both streams have been synchronized ensures that the time it takes for both events to complete has been accounted for.

    cudaEventSynchronize(clockStop);

    float timeElapsed;
    cudaEventElapsedTime(&timeElapsed, clockStart, clockStop);

    cudaEventDestroy(clockStart);
    cudaEventDestroy(clockStop);

Just after we record clockStop, we also synchronize with the clockStop event. This is to ensure that the everything preceding the event has occurred before continuing on. Due to the asynchronous nature of this function, we risk attempting to get the time that has elapsed between our events before the stop event has been recorded if we do not take this precaution.

We can determine how much time has gone by between our two events (and thus how long our combined array fills have taken) using the cudaEventElapsedTime function. Because this function is GPU-reliant, it can't pass back values like normal CPU functions can, which is why we have to declare the float timeElapsed and pass a pointer to it along with both of our events in order to get our completion time.

After we are done using our events to get the amount of time elapsed, we can then deallocate them just as we did with the stream objects, and then return the timeElapsed from the function so that we can use it in our driver code. Don't forget to change the function type to float when you add the return statement!

With that, all the heavy lifting is done. We can apply the exact same set of changes to the synchronous version of the function, and voilà! Both functions now return the amount of time they take for asynchronous and synchronous twin array fills, respectively.

We can use some fairly minimal driver code to initialize somewhat large arrays to pass to each function and then print the times for each:

int main(int argc, char *argv[]) {

    const int NUM_ELEMENTS = 1048576;

    int *h_array = (int *) malloc(sizeof(int)*NUM_ELEMENTS);
    
    int *d_array;
    cudaMalloc(&d_array, sizeof(int) * NUM_ELEMENTS);

    float sync_time = sync_twinArrayFill(h_array, d_array, NUM_ELEMENTS, 16);
    printf("Synchronous completion time: %f\n", sync_time);
    
    float async_time = async_twinArrayFill(h_array, d_array, NUM_ELEMENTS, 16);
    printf("Asynchronous completion time: %f\n", async_time);

    free(h_array);
    cudaFree(d_array);
}

In my execution environment, the overall completion times are fairly small, but the difference between synchronous and asynchronous execution is apparent:

Synchronous completion time: 4.624384
Asynchronous completion time: 4.051968

If you divide the asynchronous time by the synchronous time, you can see that the asynchronous version takes ~88% of the time that the synchronous version does, which is quite a significant improvement given the ease with which it can be implemented.

If you want to take a look at the full program for our synchronous vs asynchronous example, you can check it out here. Otherwise, you are ready to move on to the next article on asynchronous memory transfers.

Next: Asynchronous Memory Transfers