CPU to GPU: Similar concepts as CPU. More cores, wider SIMDs, more hardware threads.

Graphics rendering pipeline: Programmers provides mini-programs (shaders) that define pipeline logic at certain stages. Pipeline executes shader function for all elements of input stream.

Hack!: Early GPU-based scientific computation.

Set graphics pipeline output image to be output array size (512x512). Custom fragment shader function is mapped over the element collection.

Brook stream programming language: Early compiler that translate a generic stream program into OpenGL commands and a set of OpenGL shader programs.

Running code on CPU vs GPU

For CPU, OS loads binary into memory. -> OS selects CPU execution context. -> OS interrupts processor and prepares execution context. -> Processor executes instructions within the execution context.

For 2007 NVIDIA Tesla GPU, the application allocates buffers in GPU memory -> Application provides GPU a single kernel program binary. -> Application tells GPU to run the kernel in SPMD fashion.

CUDA: C-like language to express SPMD programs that run on GPUs. OpenCL is the open standards version of CUDA.

CUDA thread: Similar logical abstraction as pthread but the implementation is very different.

CUDA programs: SPMD programs. One program instance is one "CUDA thread". CUDA threads are organized into "thread blocks". Thread IDs can become 2-dimensional or 3-dimensional.

Example of matrix addition: 12 threads per block, 6 blocks.

CPU application code.

const int Nx = 12;
const int Ny = 6;

dim3 threadsPerBlock(4, 3, 1);
dim3 numBlocks(Nx / threadsPerBlock.x, Ny / threadsPerBlock.y, 1);

/* Nx x Ny float arrays */
float *A, *B, *C;

/* This call will execute 12 x 6 CUDA threads */
matrixAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

CUDA kernel definition.

__global__ void matrixAdd(float A[Ny][Nx],
						  float B[Ny][Nx],
						  float C[Ny][Nx])
{
	int i = blockIdx.x * blockDim.x + threadIdx.x;
	int j = blockIdx.y * blockDim.y + threadIdx.y;
	C[i][j] = A[i][j] + B[i][j];
}

Kernel definition requires manually guarding against out of bounds array access.

const int Nx = 12;  // Not a multiple of threadsPerBlock.x
const int Ny = 6;   // Not a multiple of threadsPerBlock.x

/* Kernel definition */
__global__ void matrixAdd(float A[Ny][Nx],
						  float B[Ny][Nx],
						  float C[Ny][Nx])
{
	int i = blockIdx.x * blockDim.x + threadIdx.x;
	int j = blockIdx.y * blockDim.y + threadIdx.y;
	
	if (i < Nx && j < Ny)
		C[i][j] = A[i][j] + B[i][j];
}

CUDA Address spaces: Distinct address spaces between CPU and GPU. Memory manipulation can be done with cudaMalloc/cudaFree and cudaMemcpy. Pointers allocated with cudaMalloc cannot be accessed on CPU.

/* Buffer in host memory */
float *hostA = new float[N];

/* Initialize host address space buffer */
for (int i=0; i<N; i++)
	hostA[i] = (float)i;

int bytes = sizeof(float) * N;
float *deviceA;
cudaMalloc(&deviceA, bytes);

/* Invalid to access deviceA[i] */

CUDA device memory model

Three types of memory. "GPU memory" where every thread can read/write. "Per-thread-block memory" where all the threads inside can read/write. "Per-thread memory" for local variables.

1D convolution in CUDA (Version 1)

CUDA kernel definition.

#define THREADS_PER_BLK 128

__global__ void convolve(int N, float* input, float* output) {
	index = blockIdx.x * blockDim.x + threadId.x;
	float result = 0.0f;              // Thread local
	for (int i=0; i<3; i++)
		result += input[index + i];
	output[index] = result / 3.f;     // Global variable
}

CPU application code.

int N = 2014 * 1024;
cudaMalloc(&devInput, sizeof(float) * (N+2));
cudaMalloc(&devOutput, sizeof(float) * N);

/* Initialize devInput here */

convolve<<<N/THREADS_PER_BLK, THREADS_PER_BLK>>>(N, devInput, devOutput);

1D convolution in CUDA (Version 2): Load the array elements to the per-thread-block memory region first. Loading from this region is much faster than the GPU global memory. By modifying the program, global memory is accessed 3x less.

#define THREADS_PER_BLK 128

__global__ void convolve(int N, float* input, float* output) {
	__shared__ float support[THREADS_PER_BLK + 2]; // Per thread blk
	index = blockIdx.x * blockDim.x + threadId.x;  // Thread local
	
	support[threadIdx.x] = input[index];
	if (threadIdx.x < 2) 
		support[THREADS_PER_BLK + threadIdx.x] = 
			input[index + THREADS_PER_BLK];
	
	__syncthreads();
	
	float result = 0.0f;              // Thread local
	for (int i=0; i<3; i++)
		result += support[threadIdx.x + i];
	output[index] = result / 3.f;     // Global variable
}

Required resources include 128 threads per block, "B" bytes of local data per thread. 130 floats (520 bytes) of memory in thread-block.

CUDA compilation: CUDA threads are logical. The same number of hardware threads are not ran in the GPU.

NVIDIA GTX 1080 (2016): Every core has 96 KB of shared memory. 64 hardware threads. 32-wide SIMD instructions per thread.

Interleaved "multi-threading" where 4 warp contexts can be selected out of 64 (Hyperthreading). Simultaneous execution of 4 warps at a time. For any warp, up to two runnable instructions can be done (Instruction-level parallelism).

1D convolution thread-block assignment

On NVIDIA GPUs, groups of 32 CUDA threads share one instruction stream. These groups are called "warps". A convolve thread-block is executed by 4 warps (32 x 4 = 128 CUDA threads).

Host send CUDA device a command. -> CUDA device schedules the thread-blocks. -> Only 4 thread-blocks can be fit at one time due to per-thread-block memory limitation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!