Skip to content

Using CUDA Memory

rshipley160 edited this page Oct 5, 2021 · 5 revisions

Previous: CUDA Memory Types

In this article, we are going to learn how to interact with memory on both the host and the device through a set of short examples that each highlight an aspect of working with the different types of memory found in CUDA.

Copying data between the GPU and CPU

When using separate host and device memories (either pinned or paged host memory), almost all communication between host and device has to be facilitated through the CUDA runtime or CUDA driver API.

To show how memory can be copied from the CPU to the GPU (and vice versa) we are going to create a small program that

  1. Creates an array of numbers from the host side
  2. Copies them to the GPU
  3. Uses the GPU to add 1 to each element in parallel
  4. Copies the updated array back to the GPU and prints it out to verify the addition and copies were successful.

Below is our addOne kernel. As you can see, it is very similar to the kernel from the Basic CUDA Syntax example, and is equally simple. In it, we check to make sure that each thread's ID is within the bounds of the array, and then simply increment the element of the array each thread corresponds to.

#include <cstdio>

__global__ void addOne(int *array, int size) {
    if (threadIdx.x < size)
        array[threadIdx.x] += 1;
}

In our main function, we need to create and allocate our host and device arrays. Because we are not focused on optimizing performance, we will use allocate paged host memory rather than page-locked memory, but the code for using pinned memory is commented out for your reference.

int main(int argc, char *argv[]) {
    // Allocate 32 integer array of paged memory
    int numElements = 32;
    int *h_array = (int *) malloc(sizeof(int)*numElements);

    // Alternative: page-locked memory
    // int *h_array;
    // cudaMallocHost(&h_array, sizeof(int)*numElements);

    // Allocate 32 integer array of device memory
    int *d_array;
    cudaMalloc(&d_array, sizeof(int)*numElements);

Next we use a simple loop to initialize the elements of the host array before finally copying the array from the host to the device:

    // Initialize the array with elements 0, 1, ..., n-1
    for (int i = 0; i < numElements; i++)
        h_array[i] = i;

    cudaMemcpy(d_array, h_array, sizeof(int)*numElements, cudaMemcpyHostToDevice);

As you can see in this snippet, cudaMemcpy takes three parameters: a pointer to the destination, a pointer to the source, the amount of bytes to copy, and the direction that the copy is taking place in.

Destination and Source Pointers

No matter what kind of copy you are trying to perform, these two operands will always be a pointer to the destination data structure followed by a pointer to the source of the data to be copied.

Because these are pointers, you can perform address arithmetic on them to copy subsections of arrays or other data structures. For example:

cudaMemcpy(d_array, h_array + 8, sizeof(int)*16, cudaMemcpyHostToDevice);

In this snippet, we copy only the middle 16 elements of the source array from the host to the device, by adding 8 to the source pointer (shifting the starting point 8 elements forward) and copying only 16 elements. Be careful when doing this as it can easily lead to invalid array accesses which can cause your program to fault at runtime.

Copy Direction

Any time you use cudaMemcpy or one of its alternatives we'll discuss in later articles, you'll need to provide it with a copy direction so that CUDA can translate the software copy instruction into the appropriate hardware instruction sequence, as these are different depending on the direction of the copy. These directions are:

  • cudaMemcpyHostToDevice
  • cudaMemcpyDeviceToHost
  • cudaMemcpyDeviceToDevice The first two are fairly self-explanatory and refer to CPU-GPU memory transfers, with the first device type being the source (Host in HostToDevice) and the second device type being the destination. It should also be noted that cudaMemcpyDeviceToDevice refers to copying memory to and from two separate locations on the same physical GPU; it cannot copy data between two separate GPUs.