Skip to content

Asynchronous Memory Transfers

rshipley160 edited this page Dec 4, 2021 · 5 revisions

Previous: CUDA Streams

In the last tutorial, we learned how we can execute GPU kernels in parallel with each other and with CPU code, and that this is done be enqueueing tasks onto different streams depending on whether they are independent of one another or dependent upon one another.

In most modern GPUs, we can also reorganize and parallelize GPU memory transfers using streams in order to allow them to run asynchronously to both CPU code and GPU kernels. This can occur because GPUs have one or more separate copy engines that are specifically dedicated to sending and receiving memory.

Let's take a second look at a typical host-device memory transfer, and then we'll transform it into an asynchronous transfer using streams.

    int h_array[NUM_ELEMENTS];

    int *d_array;
    cudaMalloc(&d_array, sizeof(int)*NUM_ELEMENTS);

    array_fill_1D<<<grid_size, block_size>>>(d_array, NUM_ELEMENTS, 1);

    cudaMemcpy(h_array, d_array, sizeof(int)*NUM_ELEMENTS, cudaMemcpyDeviceToHost);

Here we are allocating a host and device array, filling the device array with 1s using a simple kernel we have used previously, and finally copying the contents of the device array to the host array using the synchronous mem-copy function we are already familiar with.

The first step we are going to have to take is replacing our normal host array initialization

    int h_array[NUM_ELEMENTS];

with a declaration and initialization of pinned or page-locked host memory.

    int *h_array;
    cudaMallocHost(&h_array, sizeof(int)*NUM_ELEMENTS);

This is because asynchronous memory transfers can only be used on memory types within the GPU's control, including pinned memory, device memory, and unified memory. Paged memory does not fall into this category because chunks of memory can be freely moved in and out of known locations within host memory, and aren't guaranteed to be available in the same location like the other kinds of memory are.

After that the only thing we have to do is create a stream

    cudaStream_t stream;
    cudaStreamCreate(&stream);

And then modify our call to cudaMemcpy to instead be a cudaMemcpyAsync using our newly created stream.

    cudaMemcpyAsync(h_array, d_array, sizeof(int)*NUM_ELEMENTS, cudaMemcpyDeviceToHost, stream);

Another important note is on the topic of actually using the memory that has been copied asynchronously. In order to avoid the data synchronization hazards discussed earlier, we need to synchronize the stream before attempting to print the contents of the array, as in this example. If we did not synchronize the stream before reading from the host array, we would have no guarantee as to the values h_array would contain when printing its elements.

    cudaStreamSynchronize(stream);
    for (int i=0; i<NUM_ELEMENTS; i++)
        printf("%d ",h_array[i]);

That completes this tutorial on asynchronous memory transfers. If you want to see the fully assembled example program, you can check it out here. Otherwise, continue on to the next article where we'll take a look at the potential of using multiple streams for completing multiple asynchronous tasks at once.

Next: Performance Experiment: Multi-stream Parallelism