Skip to content

Using CUDA Memory

rshipley160 edited this page Oct 4, 2021 · 5 revisions

Previous: CUDA Memory Types

In this article, we are going to learn how to interact with memory on both the host and the device through a set of short example programs that each highlight an aspect of working with the different types of memory found in CUDA.

Copying data between the GPU and CPU

To show how memory can be copied from the CPU to the GPU (and vice versa) we are going to create a small program that creates an array of numbers from the host side, copies them to the GPU, uses the GPU to add 1 to each element in parallel, and the copies the updated array back to the GPU.

Below is our addOne kernel. As you can see, it is very similar to the kernel from the Basic CUDA Syntax example, and is equally simple. In it, we check to make sure that each thread's ID is within the bounds of the array, and then simply increment the element of the array each thread corresponds to.

#include <cstdio.h>
__global__ addOne(int *array, int size) {
    if (threadIdx.x < size)
        array[threadIdx.x] += 1;
}

In our main function, we need to create and allocate our host and device arrays. Because we are not focused on optimizing performance, we will use allocate paged host memory rather than page-locked memory, but the code for using page-locked memory is provided in comments for your reference.

int main(int argc, char *argv[]) {
    // Allocate 32 integer array of paged memory
    int numElements = 32;
    int *h_array = (int *) malloc(sizeof(int)*numElements);

    // Alternative: page-locked memory
    // int *h_array;
    // cudaMallocHost(&h_array, sizeof(int)*numElements);

    // Allocate 32 integer array of device memory
    int *d_array;
    cudaMalloc(&d_array, sizeof(int)*numElements);

Next we use a simple loop to initialize the elements of the host array before finally copying the array from the host to the device:

    for (int i = 0; i < numElements; i++)
        h_array[i] = i;

    cudaMemcpy(d_array, h_array, sizeof(int)*numElements, cudaMemcpyHostToDevice);

As you can see in this snippet, cudaMemcpy takes three parameters: a pointer to the destination, a pointer to the source, the amount of bytes to copy, and the direction that the copy is taking place in.

Destination and Source Pointers

No matter what kind of copy you are trying to perform, these two operands will always be a pointer to the destination data structure followed by a pointer to the source of the data to be copied.

Because these are pointers, you can perform address arithmetic on them to copy subsections of arrays or other data structures.
In this quick example, we copy only the middle 16 elements of the source array from the host to the device, by adding 8 to the source pointer and copying only 16 elements. Be careful when doing this as it can easily lead to invalid array accesses which can cause your program to fault at runtime.