Performance Experiment: On GPU vs Off GPU Bandwidth

[Previous: Using CUDA Memory](Using CUDA Memory)

In this experiment, we are going to use the skills we learned in the previous two articles to determine the effective bandwidth of some of the different kinds of memory transfers that CUDA can perform, including copying memory inside a kernel, using cudaMemcpy for paged CPU memory, pinned CPU memory, and GPU-GPU memory transfers, and copying data from a CPU location to a unified memory location.

Process

The timing process we want to implement for measuring each of the different transfer types is as follows:

Allocate a sizeable chunk of memory of the relevant types for the sender and receiver.
Record the time that the transfer starts
Transfer the data
Record the time that the transfer ends

To ensure our testing is accurate, we should also repeat each timing test a number of times to ensure that our results are representative of the true time each transfer takes.

Code

Because the testing process is essentially the same for each memory transfer type, only a single transfer test will be discussed in detail, but you can look at the entire memory timing program here.

Let's look at the portions of code that time the GPU-CPU pinned memory transfer. First, here's our driver code:

int main(int argc, char *argv[])
{
    int num_mibibytes = 128;
    int repetitions = 10;

    const int MI_B = 1048576;    // One MiB or 2^20 bytes

    // Determine number of integer elements for each transfer based on desired memory transfer amount
    long numElements = num_mibibytes * MI_B / sizeof(int);

    // Add CSV file headers
    printf("type,size,unit,numAccesses");
    for (int rep=1; rep <= repetitions; rep++)
        printf(",run%d",rep);
    printf("\n");

    // Other tests...
 
    // Test #4: Pinned GPU-CPU memcpy capablilty    
    float *pinnedMemcpyTimes = pinnedDtoH(numElements, repetitions);

    printf("pinnedDtoH,%d,MB,2",num_mibibytes);
    for (int rep=0; rep < repetitions; rep++)
        printf(",%f",pinnedMemcpyTimes[rep]);
    printf("\n");

    free(pinnedMemcpyTimes);

    // Other tests...
}

In the driver code there's not really anything terribly interesting: we set the amount of memory we want to transfer and use that value to determine how large the integer arrays used for the tests should be, then create some column headers so that our CSV output is more readable, before finally running the testing function and printing the time elapsed for each of its runs.

One item of note is the numAccesses column in our CSV file. It is needed so that we can accurately calculate bandwidth based on the time the transfer takes, because the transfer actually constitutes two memory accesses - a read from the GPU and a write to the CPU - and we have to account for that.

Now have a look at the timing code itself:

float *pinnedDtoH(long numElements, int repetitions) {
    cudaEvent_t clockStart, clockStop;
    cudaEventCreate(&clockStart);
    cudaEventCreate(&clockStop);

    int *d_input, *h_output;
    cudaMalloc(&d_input, numElements*sizeof(int));
    cudaMallocHost(&h_output, numElements*sizeof(int));
    

    float trialTime;
    float *results = (float *) malloc(sizeof(float)*repetitions);

    // Initial run through to avoid any cold-start outliers
    cudaMemcpy(h_output, d_input, numElements*sizeof(int), cudaMemcpyDeviceToDevice);

    for (int rep=0; rep < repetitions; rep++)
    {
        cudaEventRecord(clockStart, 0);

            cudaMemcpy(h_output, d_input, numElements*sizeof(int), cudaMemcpyDeviceToHost);
        
        cudaEventRecord(clockStop, 0);
        cudaEventSynchronize(clockStop);
        cudaEventElapsedTime(&trialTime, clockStart, clockStop);

        results[rep] = trialTime;
    }

    cudaFree(d_input);
    cudaFree(h_output);

    return results;
}

The first thing you may notice is the function's return type - a float pointer. In the end this function will return an array of floats which represent the time that a pinned GPU-CPU transfer takes, so that the driver code can then output this information in the proper format.

The two parameters required are the number of elements to transfer, which is calculated by the driver code based on the desired memory transfer size, and the number of times we wish to repeat the test, which directly controls the size of the output array.

Next we encounter a part of the CUDA programming model we haven't yet covered: events. We won't go into detail on events here because they will be covered in more detail later on, but essentially events serve as place markers along the timeline of execution for both the GPU and the CPU. We can use two such place markers to determine the time it takes for CUDA kernels and API calls to complete using the cudaEventElapsedTime function, which is how we obtain the time for each transfer in each run of this function.

You may also have noticed that we perform one seemingly pointless memory transfer before we start the timing tests; the purpose of this transfer is to "warm up" the hardware prior to running the first test because computer hardware tends to consistently perform poorly on the first run of a given operation.

After compiling the program, we can run it to get our results printed to the screen:

type,size,unit,numAccesses,run1,run2,run3,run4,run5,run6,run7,run8,run9,run10
kernelCopy,128,MB,2,0.003200,0.002592,0.002560,0.002560,0.002560,0.002560,0.002528,0.002560,0.002592,0.002560
memcpyDtoD,128,MB,2,1.542560,1.546624,1.543520,1.541920,1.542176,1.540736,1.542944,1.541280,1.541280,1.541760
pagedDtoH,128,MB,2,100.505180,13.557280,13.628864,13.905536,13.623616,13.848544,13.616352,13.728224,13.783104,13.718432
pinnedDtoH,128,MB,2,11.680608,11.608160,11.739264,12.016480,11.996864,12.033216,11.740512,12.086304,12.004992,12.327136
unifiedMemcpy,128,MB,2,18.262560,18.268831,18.123360,17.547104,18.448736,18.231968,19.074432,18.911808,19.994112,17.838465

For convenience, this information has been saved to results.csv which is available in the folder for this experiment in the repository.

Analysis

Now that we have our experiment data set, which contains the time (in seconds) for each type of memory transfer to complete, let's determine an approximate bandwidth for each of the transfer types.

For this purpose, a Jupyter notebook containing ready-to-run Python data visualization code has been provided which takes the run times and calculates bandwidth using the formula below before creating box and bar plots showing the bandwidths of each memory transfer type.

We can calculate bandwidth according to the formula bandwidth = memoryAmt * numAccesses / runTime, where memoryAmt is the amount transferred in bytes and the result is also in terms of bytes. In this case we translate bandwidth into terms of GiB by dividing the inital bandwidth amount by 2^30, which is the number of bytes in a GiB.

After this conversion, we get the following visualization for all of our bandwidths:

The red line in each box plot represents the median value for each transfer type, which is the best approximation for our purposes. The mean can also be used to approximate this value, but it is sensitive to strong outliers which are usually prevalent in this kind of data, so in this case the median is a more representative descriptor of each of these sets of values.

The notebook also produces a bar chart containing four of the five median bandwidth values to make comparisons easier between the different types. The kernel copy method is not included because it would skew the chart and make the other four values unreadable.

Insights

We can glean a number of valuable insights regarding best practices for memory usage from this data.
Perhaps the most obvious thing we can learn is that copying data inside the GPU using a kernel is much more efficient than any of the CUDA API methods - by a factor of more than 500! This is largely due to two things:

The GPU has a very wide memory bus - generally 512 bits or more - and a very high bus frequency, meaning that in-GPU I/O will almost always outpace out-of-GPU I/O
The kernel transfer method relies on compute threads to perform the work of reading and writing memory, whereas API functions use a form of Direct Memory Access to transfer data. DMA is great for transferring data while a kernel is running, but the concurrent memory accesses enabled by the massive parallelism of a kernel can't be beat when you need a fast, in-GPU memory transfer

Unsurprisingly, the GPU-GPU cudaMemcpy operation comes second in our bandwidth competition primarily because the GPU bus is faster and wider than the external bus, but it lacks the punch that the kernel copy provides because it utilizes DMA.

When it comes to CPU-GPU transfers, the clear winner is pinned host memory, due to its being locked in one location in host memory. The lack of cache searches and potential misses presents a somewhat significant performance advantage over paged memory, though it comes at the cost of limiting the total amount of memory that can be allocated. Unlike paged memory, which can essentially be as large as the physical storage on a computer allows, pinned memory has a relatively low capacity.

Coming in last in regards to bandwidth is unified memory. Looking at its performance, you might think that it would be best to avoid unified memory altogether, but it is important to realize that these values only represent bandwidth for programs which have equal amounts of memory reads and writes, which is not the intended use of unified memory. Unified memory should be used in cases that require the GPU to read values from a memory location multiple times, with limited writes from the CPU (and potentially the GPU). Using unified memory in this way would result in a more performant program than one which used paged or pinned memory in its place because the memory physically resides on the GPU, which is much faster to access as was discussed earlier.

Further Exploration

You may have noticed that this article only explores the bandwidth of global GPU variables. This is because of the behavior of the CUDA compiler as it relates to kernel optimizations. Variables with a smaller scope, such as local and shared variables, are very difficult to time because the compiler is too efficient for its own good in this case - if we were to simply copy array values from one variable to another as we have done in all of these tests, the copies are simply removed from the optimized kernel because the compiler realizes that the copy doesn't affect anything.

Another direction that you may wish to explore is to vary the balance of reads and writes in order to determine how different I/O profiles affect the bandwidth of each type of transfer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly