Performance Experiment: On GPU vs Off GPU Bandwidth

Previous: Using CUDA Memory

In this experiment, we are going to use the skills we learned in the previous two articles to determine the effective bandwidth of some of the different kinds of memory transfers that CUDA can perform, including:

Copying memory between locations inside the GPU using a kernel
Copying data between two on-GPU locations using cudaMemcpy
Copying data from the GPU to the CPU using paged host memory, pinned host memory, and unifed memory

Process

The timing process we want to implement for measuring each of the different transfer types is as follows:

Allocate a sizeable chunk of memory of the relevant types for the sender and receiver.
Record the time that the transfer starts
Transfer the data
Record the time that the transfer ends

To ensure our testing is accurate, we should also repeat each timing test a number of times to ensure that our results are representative of the true time each transfer takes.

Code

Because the testing process is essentially the same for each memory transfer type, only a single transfer test will be discussed in detail, but you can look at the entire memory timing program here.

Let's look at the portions of code that time the GPU-CPU pinned memory transfer. First, here's our driver code:

int main(int argc, char *argv[])
{
    int num_mibibytes = 128;
    int repetitions = 10;

    const int MI_B = 1048576;    // One MiB or 2^20 bytes

    // Determine number of integer elements for each transfer based on desired memory transfer amount
    long numElements = num_mibibytes * MI_B / sizeof(int);

In this first segment, we set the amount of memory to transfer and the number of times each transfer should be timed, and determine the number of integer elements to run each test with based on the desired size of the memory transfer.

    // Add CSV file headers
    printf("type,size,unit,numTransfers");
    for (int rep=1; rep <= repetitions; rep++)
        printf(",run%d",rep);
    printf("\n");

    // Other tests...
 
    // Test #4: Pinned GPU-CPU memcpy capablilty    
    float *pinnedMemcpyTimes = pinnedDtoH(numElements, repetitions);

    printf("pinnedDtoH,%d,MB,1",num_mibibytes);
    for (int rep=0; rep < repetitions; rep++)
        printf(",%f",pinnedMemcpyTimes[rep]);
    printf("\n");

    free(pinnedMemcpyTimes);

    // Other tests...
}

In this secondary segment we generate headers for our output, which will eventually become a CSV file we can use for analysis later on. It is important to include metadata such as the amount of data transferred and the number of memory transfers each type generates so that other people have all of the data necessary to understand our data set. Otherwise we would just have a list of decimal numbers with no explanation at all as to what they represent.

One item of note is the numTransfers column in our CSV file. It is needed so that we can accurately calculate bandwidth for different types of memory transfers. Memory transfers performed inside a kernel consist of two memory transfers since the stored value being transferred is read in one operation and written in the next operation, while all of the other memory operations we are timing should be considered as single memory transfers, because the memory is directly transferred from the source to the destination all in one step.

Finally, we run each of the timing tests and print a row in the CSV file that holds all of the values of that test, and free the array that was holding the results of our timing tests, which was allocated and filled by the timing function we are going to take a look at next, starting with its header:

float *pinnedDtoH(long numElements, int repetitions)

The first thing you may notice is the function's return type - a float pointer. In the end this function will return an array of floats which represent the time that a pinned GPU-CPU transfer takes, so that the driver code can then output this information in the proper format.

The two parameters required are the number of elements to transfer, which is calculated by the driver code based on the desired memory transfer size, and the number of times we wish to repeat the test, which directly controls the size of the output array.

{
    cudaEvent_t clockStart, clockStop;
    cudaEventCreate(&clockStart);
    cudaEventCreate(&clockStop);

Next we encounter a part of the CUDA programming model we haven't yet covered: events. We won't go into detail on events here because they will be covered in more detail later on, but essentially events serve as place markers along the timeline of execution for both the GPU and the CPU. We can use two such place markers to determine the time it takes for CUDA kernels and API calls to complete using the cudaEventElapsedTime function, which is in a later code segment.

    int *d_input, *h_output;
    cudaMalloc(&d_input, numElements*sizeof(int));
    cudaMallocHost(&h_output, numElements*sizeof(int));
    
    float trialTime;
    float *results = (float *) malloc(sizeof(float)*repetitions);

    // Initial run through to avoid any cold-start outliers
    cudaMemcpy(h_output, d_input, numElements*sizeof(int), cudaMemcpyDeviceToDevice);

In this section we declare local variables and allocate memory where needed to create our test arrays of input and output values and the array we will store each of our transfer completion times in.

We also perform a seemingly pointless mem-copy operation - but it isn't pointless. Transferring the data once before doing it for real prepares the computer running this program by "warming up" the hardware. Not doing this could result in a significant outlier in our data set on the first run of every series of tests.

    for (int rep=0; rep < repetitions; rep++)
    {
        cudaEventRecord(clockStart, 0);

            cudaMemcpy(h_output, d_input, numElements*sizeof(int), cudaMemcpyDeviceToHost);
        
        cudaEventRecord(clockStop, 0);
        cudaEventSynchronize(clockStop);
        cudaEventElapsedTime(&trialTime, clockStart, clockStop);

        results[rep] = trialTime;
    }

    cudaFree(d_input);
    cudaFree(h_output);

    return results;
}

This section of the code contains the actual timing test itself. We mark the beginning of the timing period by recording our clockStart event, perform the transfer, and then mark the end of the timing period by recording clockStop. We then synchronize to ensure that the clockStop event has been recorded before continuing on to get the amount of time that has passed between clockStart and clockStop, which is stored in our results array.

After the test has been repeated enough times, we then free the memory used for the test and return the array containing the transfer completion times. It should be noted that the return value of this function is allocated memory and has to be freed later on in order to avoid a memory leak.

After compiling the program, we can run it to get our results printed to the screen:

type,size,unit,numTransfers,run1,run2,run3,run4,run5,run6,run7,run8,run9,run10
kernelCopy,128,MB,2,0.003200,0.002592,0.002560,0.002560,0.002560,0.002560,0.002528,0.002560,0.002592,0.002560
memcpyDtoD,128,MB,1,1.542560,1.546624,1.543520,1.541920,1.542176,1.540736,1.542944,1.541280,1.541280,1.541760
pagedDtoH,128,MB,1,100.505180,13.557280,13.628864,13.905536,13.623616,13.848544,13.616352,13.728224,13.783104,13.718432
pinnedDtoH,128,MB,1,11.680608,11.608160,11.739264,12.016480,11.996864,12.033216,11.740512,12.086304,12.004992,12.327136
unifiedMemcpy,128,MB,1,18.262560,18.268831,18.123360,17.547104,18.448736,18.231968,19.074432,18.911808,19.994112,17.838465

For convenience, this information has been saved to results.csv which is available in the folder for this experiment in the repository.

Analysis

Now that we have our experiment data set, which contains the time (in seconds) for each type of memory transfer to complete, let's determine an approximate bandwidth for each of the transfer types.

For this purpose, a small bit of Python data visualization code has been provided which takes the run times and calculates bandwidth using the formula below before creating box and bar plots showing the bandwidths of each memory transfer type, available here.

We can calculate bandwidth according to the formula bandwidth = memoryAmt * numTransfers / runTime, where memoryAmt is the amount transferred in bytes and the result is also in terms of bytes. In this case we translate bandwidth into terms of GiB by dividing the inital bandwidth amount by 2^30, which is the number of bytes in a GiB.

After this conversion, we get the following visualization for all of our bandwidths:

The red line in each box plot represents the median value for each transfer type, which is the best approximation for our purposes. The mean can also be used to approximate this value, but it is sensitive to strong outliers which are usually prevalent in this kind of data, so in this case the median is a more representative descriptor of each of these sets of values.

The notebook also produces a bar chart containing four of the five median bandwidth values to make comparisons easier between the different types. The kernel copy method is not included because it would skew the chart and make the other four values unreadable.

Insights

We can glean a number of valuable insights regarding best practices for memory usage from this data.
Perhaps the most obvious thing we can learn is that copying data inside the GPU using a kernel is much more efficient than any of the CUDA API methods - by a factor of more than 500! This is largely due to two things:

The GPU has a very wide memory bus - generally 512 bits or more - and a very high bus frequency, meaning that in-GPU I/O will almost always outpace out-of-GPU I/O
The kernel transfer method relies on compute threads to perform the work of reading and writing memory, whereas API functions use a form of Direct Memory Access to transfer data. DMA is great for transferring data while a kernel is running, but the concurrent memory accesses enabled by the massive parallelism of a kernel can't be beat when you need a fast, in-GPU memory transfer

Unsurprisingly, the GPU-GPU cudaMemcpy operation comes second in our bandwidth competition primarily because the GPU bus is faster and wider than the external bus, but it lacks the punch that the kernel copy provides because it utilizes DMA.

When it comes to CPU-GPU transfers, the clear winner is pinned host memory, due to its being locked in one location in host memory. The lack of cache searches and potential misses presents a somewhat significant performance advantage over paged memory, though it comes at the cost of limiting the total amount of memory that can be allocated. Unlike paged memory, which can essentially be as large as the physical storage on a computer allows, pinned memory has a relatively low capacity.

Coming in last in regards to bandwidth is unified memory. Looking at its performance, you might think that it would be best to avoid unified memory altogether, but it is important to realize that these values only represent bandwidth for programs which have equal amounts of memory reads and writes, which is not the intended use of unified memory. Unified memory should be used in cases that require the GPU to read values from a memory location multiple times, with limited writes from the CPU (and potentially the GPU). Using unified memory in this way would result in a more performant program than one which used paged or pinned memory in its place because the memory physically resides on the GPU, which is much faster to access as was discussed earlier.

Further Exploration

You may have noticed that this article only explores the bandwidth of global GPU variables. This is because of the behavior of the CUDA compiler as it relates to kernel optimizations. Variables with a smaller scope, such as local and shared variables, are very difficult to time because the compiler is too efficient for its own good in this case - if we were to simply copy array values from one local variable to another as we have done in all of these tests, the copies are simply removed from the optimized kernel because the compiler realizes that the copy doesn't affect anything.

Another direction that you may wish to explore is to vary the balance of reads and writes in order to determine how different I/O profiles affect the bandwidth of each type of transfer.

Next: Thread and Block Scheduling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly