Skip to content

Performance Experiment: Multi Stream Parallelism

rshipley160 edited this page Dec 10, 2021 · 3 revisions

Previous: Asynchronous Memory Transfers

In this experiment, our goal is to get an idea of the amount of performance we can gain by leveraging CUDA's asynchronous compute capabilities. We can do this by observing the amount of time taken to complete the same workload using a varying amount of streams. The workload in question will be a set of matrix multiplication operations which the GPU will balance as evenly as it can among an increasing amount of streams.

The best part about this experiment is that it is entirely composed of elements that we've seen before at this point in our CUDA journey, those elements being:

  • A matrix multiplication kernel
  • Asynchronous execution (streams)
  • Timing using events
  • Two-dimensional kernel configurations

Experiment

Let's jump straight into our main function:

int main(int argc, char *arg[]) {
    // Number of rows in output matrix
    const int ROWS = 32;

    // Number of columns in output matrix
    const int COLS = 16;

    // Length of aligned dimension in input matrices
    const int ALIGNED = 131072;

    // Length of each block axis - blocks contain 256 threads
    const int BLOCK_LENGTH = 16;

    // The max number of streams to run at once - and the number of matrices to multiply
    const int MAX_STREAMS = 16;

    const int NUM_TRIALS = 20;

This first section consists solely of constants that are referenced later on in the program, most of which are self-explanatory. The ALIGNED constant, however, might be perplexing - why is it so large, and what does it represent?

As in our synchronous matrix multiplication program, ROWS and COLS are used to represent the dimensions of the output of the matrix multiplication; we need to know these values early on in order to correctly configure the kernel since the kernel grid for the multiplication is based on these dimensions. ALIGNED represents the number of columns in the first input matrix, which is aligned with the number of rows in the second input matrix, thus where it gets its name.
As for the size, it might help to recall the synchronous matrix multiplication program in order to realize that the length of the aligned axes also controls the length of iteration inside the multiplication algorithm. By making ALIGNED large, we can make our computation times long enough to easily discern any differences caused by the use of additional streams.
The other values are somewhat arbitrary and might need to be tweaked a bit to be more in line with your device's capabilities. The GPU that this program was tested on has about six thousand execution cores, so the values of ROWS, COLS, and BLOCK_LENGTH were chosen as to stay within reason of that limit so that time is not spent waiting on GPU resources to become free.

    dim3 grid_A = dim3(ALIGNED / BLOCK_LENGTH + 1, ROWS / BLOCK_LENGTH + 1);
    dim3 grid_B = dim3(COLS / BLOCK_LENGTH + 1, ALIGNED / BLOCK_LENGTH + 1);

    dim3 blockSize = dim3(BLOCK_LENGTH, BLOCK_LENGTH);

    // Initialize and fill two input and one output matrix per multiplication problem
    float *d_A[MAX_STREAMS], *d_B[MAX_STREAMS], *d_C[MAX_STREAMS];
    for (int i=0; i<MAX_STREAMS; i++) {
        cudaMalloc(&(d_A[i]), sizeof(float)*ROWS*ALIGNED);
        cudaMalloc(&(d_B[i]), sizeof(float)*ALIGNED*COLS);
        cudaMalloc(&(d_C[i]), sizeof(float)*ROWS*COLS);

        matrix_fill<<<grid_A, blockSize>>>(d_A[i], ROWS, ALIGNED, 1);
        matrix_fill<<<grid_B, blockSize>>>(d_B[i], ALIGNED, COLS, 1);
    }

Next we allocate a series of arrays (flattened matrices) so that there are enough separate input and output matrices for MAX_STREAMS (in this case, 16) different multiplication problems.
The grid calculation at the top of the code block should be fairly intuitive, but just in case it isn't: we are setting the kernel grid size in order to accommodate all of the items in matrices A and B. A has dimensions ALIGNED x ROWS, so we divide each of those to determine how many BLOCK_LENGTH long blocks are needed to fully cover all of the elements of the input matrices, which is repeated for matrix B. The 1 is then added as a safety measure; if an input dimension isn't evenly divisible by BLOCK_LENGTH, the integer division used would always round down any remainders, meaning that any remaining elements are omitted. The extra block ensures this doesn't happen.

This is then used to configure two launches of the matrix_fill algorithm each loop, which fills both input matrices of each problem with 1s. The matrix_fill kernel is only slightly different from the one used earlier in the series, so we won't cover it in detail, but here it is for your convenience:

// Fill matrix of given size with value
__global__ void matrix_fill(float *matrix, int rows, int cols, float value) {
    // The row of the output cell
    int i = blockDim.y * blockIdx.y + threadIdx.y;
    if (i >= rows) return;

    // The column of the output cell
    int j = blockDim.x * blockIdx.x + threadIdx.x;
    if (j >= cols) return;

    matrix[cols*i + j] = value;
}

Returning to main, we then allocate a pool of streams that we will use for each run of the experiment, as well as the events we will use to time each run, and then print what will become the header of our eventual CSV output file:

    // Stream pool to be used in all timing tests
    cudaStream_t streams[MAX_STREAMS];
    for (int s=0; s<MAX_STREAMS; s++)
        cudaStreamCreate(&streams[s]);

    cudaEvent_t clockStart, clockStop;
    cudaEventCreate(&clockStart);
    cudaEventCreate(&clockStop);

    // Print CSV-style header row
    for (int i=0; i<MAX_STREAMS; i++) {
        printf("%d Stream",i+1);
        if (i>0)
            printf("s");
        // Newline after last item
        if (MAX_STREAMS-i==1)
            printf("\n");
        // Otherwise comma
        else
            printf(",");
    }

Finally we arrive at the portion of main that actually does the work:

    // Perform each set of tests NUM_TRIALS times so we can get an average measure of performance
    for (int trial=0; trial<NUM_TRIALS; trial++){

        // Iterate through number of streams used to observe performance impact
        for (int streamsUsed = 1; streamsUsed <= MAX_STREAMS; streamsUsed++) {

            dim3 gridSize = dim3(COLS / BLOCK_LENGTH, ROWS / BLOCK_LENGTH);
            gridSize.x += (COLS % BLOCK_LENGTH) ? 1 : 0;
            gridSize.y += (ROWS % BLOCK_LENGTH) ? 1 : 0;

            cudaEventRecord(clockStart, 0);

                for (int i=0; i<MAX_STREAMS; i++)
                    // Split problems between active streams -> streams[i%streamsUsed]
                    matrix_mul<<<gridSize, blockSize, 0, streams[i%streamsUsed]>>>(d_A[i], d_B[i], d_C[i], ROWS, COLS, ALIGNED);

                for (int s=0; s<streamsUsed; s++)
                    cudaStreamSynchronize(streams[s]);

            cudaEventRecord(clockStop, 0);

            float timeElapsed;
            cudaEventSynchronize(clockStop);
            cudaEventElapsedTime(&timeElapsed, clockStart, clockStop);

            // Print results in CSV format
            printf("%.4f", timeElapsed);
            if (streamsUsed==MAX_STREAMS)
                printf("\n");
            else
                printf(",");
        }
    }

Working from the inside out, the code in the body of our nested for loop:

  • Determines the appropriate size for the matrix_mul kernel grid using the method from earlier
  • Starts the clock
  • Launches the matrix_mul kernel MAX_STREAMS (16) times on various streams (this will make sense in a moment)
  • Synchronizes all of the involved streams
  • Stops the clock
  • Reads the time the work took and prints it out in CSV format

This work is then repeated MAX_STREAMS times, with each new iteration adding a new stream to help carry the workload. The tasks are then divided among all of the active streams which is done in the fourth parameter to the matrix multiplication kernel configuration, streams[i%streamsUsed]. The modulo operator limits the index of streams between 0 and streamsUsed, barring access to the rest of the streams in the pool, and also divides the kernels among the streams as evenly as possible.

The outermost loop then repeats the task so we have a decent sample of test results to use when performing our analysis. Speaking of...

Analysis

By running the full async_matrix_mul program and saving the results to a .csv file, we have effectively created the dataset we need to begin learning about how asynchronous parallelism can further impact GPU performance. With a few quick lines of Python we can turn this data into a beautiful bar chart and begin examining the effect of increasing the amount of streams dedicated to the same task.

We can immediately see that even having 2 streams is a significant improvement to performing all 16 multiplications synchronously, though it's not quite a 100% improvement like you might expect. The addition of more streams continues to have an impact until we reach seven streams, which curiously doesn't seem to impact performance at all. This is because, up until this point, every new stream added meant that the maximum number of tasks any one stream had went down. Adding one stream to a lone stream means that at most each only has 8 computations to do, similarly having 3 streams reduces this number to 6, etc, until we get to 7. At 6 streams, each one is responsible for a max of 3 kernels, but at 7, this number does not change and we still end up waiting for the streams that have 3 tasks to complete, resulting in an almost identical amount of time. The same happens for streams 9 - 15 where we observe another plateau in performance because at least 1 stream has to perform 2 tasks in each of those cases. We observe a final increase in performance with 16 streams because this allows for all streams to only have a single task to complete, rather than having to wait on some streams to complete a second.

You may have noticed that the results for 5 and 15 streams were somewhat glossed over - this was to avoid confusion. Using 5 streams and 15 streams causes the same issues as discussed at streams 7 and 9, but in these special cases, it is only the first stream that must perform two tasks. Because they are initialized first and are the only stream with extra work, the streams can be synchronized slightly sooner because the work was started slightly earlier, resulting in the slight performance increase we see in the chart.

Overall, there are three big takeaways from this data:

  1. Using asynchronous execution in your CUDA programs can greatly improve performance
  2. The speedup gained by using multiple streams usually diminishes as the number of streams increases
  3. The amount of streams used should take into account the size of the problem they are being used on

Next: Basic Synchronization Methods