Copyright (c) E. Krishnasamy, 2013-2023 UL HPC Team <[email protected]>
- Understanding the OpenACC programming model
- How to use some of the directives from OpenACC to parallelize the code - compute constructs, loop constructs, data clauses
- Implementing OpenACC parallel strategy in C/C++ and FORTRAN programming languages
- Simple mathematical examples to support and understand the OpenACC programming model
- Finally, show you how to run these examples using Iris cluster (ULHPC) - both interactively and using a batch job script
- C/C++ and/or FORTRAN languages
- OpenMP or some basic parallel programming concept (advantage not necessary)
NOTE: This lecture is limited to just 45 min; it only a covers from basics to intermediate tutorial about OpenACC. To know more about (from basic to advanced) CUDA programming and OpenACC programming model, please refer to PRACE MOOC GPU Programming for Scientific Computing and Beyond - Dr. Ezhilmathi Krishnasamy and Prof. Pascal Bouvry
Ensure you are able to connect to the UL HPC clusters.
In particular, recall that the module
command is not available on the access frontends. For all tests and compilation, you MUST work on a computing node
Now you'll need to pull the latest changes in your working copy of the ULHPC/tutorials you should have cloned in ~/git/github.com/ULHPC/tutorials
(see "preliminaries" tutorial)
(access)$ cd ~/git/github.com/ULHPC/tutorials
(access)$ git pull
Now configure a dedicated directory ~/tutorials/openacc
for this session
# return to your home
(access)$ mkdir -p ~/tutorials/openacc
(access)$ cd ~/tutorials/openacc
# create a symbolic link to the top reference material
(access)$ ln -s ~/git/github.com/ULHPC/tutorials/gpu/openacc/basics ref.d # Symlink to the reference tutorial material
# copy / synchronize a copy of the exercises
(access)$ rsync -avzu ref.d/exercises . # DO NOT forget the trailing .
Advanced users (eventually yet strongly recommended), create a Tmux session (see Tmux cheat sheet and tutorial) or GNU Screen session you can recover later. See also "Getting Started" tutorial .
See also GPU jobs documentation.
$ ssh iris-cluster # The only cluster featuring GPU
$ si-gpu -G 1 --ntasks-per-node 1 -c 7 -t 00:30:00 # (eventually) --reservation=hpcschool-gpu
$ si-gpu -G 1 --reservation=hpcschool-gpu -N 1 -c 14 -t 01:00:00
$> module spider nvhpc
$> module load compiler/NVHPC/21.2
- A CPU frequency is higher compared to a GPU.
- But a GPU can run many threads in parallel compared to a CPU.
- On a GPU, the cores are grouped and called "Streaming Multiprocessor - SM".
- Even the Nvidia GPU has a "Tensor Process Unit - TPU" to handle the AI/ML computations in an optimized way.
- GPUs are based on the "Single Instruction Multiple Threads".
- Threads are executed in a group on the GPU; typically they have 32 threads. This is called "warps" on the Nvidia GPU and "wavefronts" on the AMD GPU.
- Step 1: application preparation, initialize the memories on both CPU and GPU
- Step 2: transfer the data to a GPU
- Step 3: do the computation on a GPU
- Step 4: transfer the data back to a CPU
- Step 5: finalize the application and delete the memories on both CPU and GPU
- OpenACC is not GPU programming
- OpenACC is expressing the parallelism in your existing code
- OpenACC can be used in both Nvidia and AMD GPUs
- “OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan.” - Buddy Bland, Titan Project Director, Oak Ridge National Lab
- “OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP.” - Michael Wong, CEO of OpenMP Directives Board
- Libraries: easy to use with very limited knowledge with GPU programming - cuBLAS, cuFFT, CUDA Math Library, etc.
- Directive based programming model: will accelerate the application with using directives in the existing code - OpenACC and OpenMP (might be applicable in the future)
- Programming languages: low level programming languages that will further optimize the application on the accelerator - CUDA, OpenCL, etc.
- OpenACC is supported by the Nvidia, PGI, GCC, and HPE Gray (only for FORTRAN) compilers
- Now PGI is part of Nvidia, and it is available through Nvidia HPC SDK
- Compute constructs:
- parallel and kernel
- Loop constructs:
- loop, collapse, gang, worker, vector, etc.
- Data management clauses:
- copy, create, copyin, copyout, delete and present
- Others:
- reduction, atomic, cache, etc.
- More information about the OpenACC directives can be found in here
// C/C++
#include "openacc.h"
#pragma acc <directive> [clauses [[,] clause] . . .] new-line
<code>
!! Fortran
use openacc
!$acc <directive> [clauses [[,] clause] . . .]
<code>
// Hello_World.c | // Hello_World_OpenACC.c
void Print_Hello_World() | void Print_Hello_World()
{ | {
| #pragma acc kernels
for(int i = 0; i < 5; i++) | for(int i = 0; i < 5; i++)
{ | {
printf("Hello World!\n"); | printf("Hello World!\n");
} | }
} | }
- compilation:
nvc -fast -Minfo=all -acc=gpu -gpu=cc70 Hello_World.c
- The compiler will already give much info; what do you see?
$> nvc -fast -Minfo=all -acc=gpu -gpu=cc70 Hello_World.c
Print_Hello_World:
6, Loop not parallelized: contains call
main:
14, Print_Hello_World inlined, size=4 (inline) file Hello_World.c (5)
6, Loop not vectorized/parallelized: contains call
- Now add either kernels or parallel directives to vectorize/parallelize the loop
$> nvc -fast -Minfo=all -acc=gpu -gpu=cc70 Hello_World_OpenACC.c
print_hello_world:
6, Loop is parallelizable
Generating Tesla code
6, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
- As we can see above the loop is parallelized!.
!! Hello_World.f90 | !! Hello_World_OpenACC.f90
subroutine Print_Hello_World() | subroutine Print_Hello_World()
integer :: i | integer :: i
| !$acc kernels
do i = 1, 5 | do i = 1, 5
print *, "hello world" | print *, "hello world"
end do | end do
| !$acc end kernels
end subroutine Print_Hello_World | end subroutine Print_Hello_World
- Compile the Hello_World.f90 and compiler tells us that the loop is not parallelized.
$> nvfortran -fast -Minfo=all -acc=gpu -gpu=cc70 Hello_World.f90
print_hello_world:
5, Loop not vectorized/parallelized: contains call
- Now run the Hello_World_OpenACC.f90 either using kernels or parallel and we can already notice that loop is vectorized/parallelized.
$> nvfortran -fast -Minfo=all -acc=gpu -gpu=cc70 Hello_World_OpenACC.f90
print_hello_world:
6, Loop is parallelizable
Generating Tesla code
6, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
Note: this above example shows you how to create the parallel region using parallel and kernels. It is quite useful when multiple regions need to be parallelized. However, the above example has just one parallel region.
- Here we can consider simple vector addition example for the OpenACC loop directive
// Vector_Addition.c | // Vector_Addition_OpenACC.c
float * Vector_Addition | float * Vector_Addition
(float *restrict a, float *restrict b,| (float *restrict a, float *restrict b,
float *restrict c, int n) | float *restrict c, int n)
{ | {
| #pragma acc kernels loop
| copyin(a[:n], b[0:n]) copyout(c[0:n])
for(int i = 0; i < n; i ++) | for(int i = 0; i < n; i ++)
{ | {
c[i] = a[i] + b[i]; | c[i] = a[i] + b[i];
} | }
return c; |
} | }
- The loop will parallelize the for loop plus also accommodate other OpenACC clauses, for example here copyin and copyout.
- The above example needs two vectors to be copied to GPU and one vector needs to send the value back to CPU.
- copyin will create the memory on the GPU and transfer the data from CPU to GPU.
- copyout will create the memory on the GPU and transfer the data from GPU to CPU.
!! Vector_Addition.f90 | !! Vector_Addition_OpenACC.f90
module Vector_Addition_Mod | module Vector_Addition_Mod
implicit none | implicit none
contains | contains
subroutine Vector_Addition(a, b, c, n) | subroutine Vector_Addition(a, b, c, n)
!! Input vectors | !! Input vectors
real(8), intent(in), dimension(:) :: a | real(8), intent(in), dimension(:) :: a
real(8), intent(in), dimension(:) :: b | real(8), intent(in), dimension(:) :: b
real(8), intent(out), dimension(:) :: c | real(8), intent(out), dimension(:) :: c
integer :: i, n | integer :: i, n
| !$acc kernels loop copyin(a(1:n), b(1:n))
| copyout(c(1:n))
do i = 1, n | do i = 1, n
c(i) = a(i) + b(i) | c(i) = a(i) + b(i)
end do | end do
| !$acc end kernels
end subroutine Vector_Addition | end subroutine Vector_Addition
end module Vector_Addition_Mod | end module Vector_Addition_Mod
- Now compile and run the above code as we did previously.
// Matrix_Multiplication.c
for(int row = 0; row < width ; ++row) // Matrix_Multiplication_OpenACC.c
{ | pragma acc kernels loop collapse(2)
for(int col = 0; col < width ; ++col) | reduction(+:sum)
{ | for(int row = 0; row < width ; ++row)
sum=0; | {
for(int i = 0; i < width ; ++i) | for(int col = 0; col < width ; ++col)
{ | {
sum += a[row*width+i] | sum=0;
* b[i*width+col]; | for(int i = 0; i < width ; ++i)
} | {
c[row*width+col] = sum; | sum += a[row*width+i]
} | * b[i*width+col];
} | }
| c[row*width+col] = sum;
| }
| }
=======
// Vector_Addition.c | // Vector_Addition_OpenACC.c
float * Vector_Addition | float * Vector_Addition
(float *restrict a, float *restrict b,| (float *restrict a, float *restrict b,
float *restrict c, int n) | float *restrict c, int n)
{ | { float sum=0;
| #pragma acc kernels loop
| reduction(+:sum) copyin(a[:n], b[0:n]) copyout(c[0:n])
for(int i = 0; i < n; i ++) | for(int i = 0; i < n; i ++)
{ | {
c[i] = a[i] + b[i]; | c[i] = a[i] + b[i];
} | sum+=c[i];
return c; | }
} | }
!! Matrix_Multiplication.f90 |!! Matrix_Multiplication_OpenACC.f90
do row = 0, width-1 | !$acc loop collapse(2) reduction(+:sum)
do col = 0, width-1 | do row = 0, width-1
sum=0 | do col = 0, width-1
do i = 0, width-1 | sum=0
sum = sum + (a((row*width)+i+1) | do i = 0, width-1
* b((i*width)+col+1)) | sum = sum + (a((row*width)+i+1)
enddo | * b((i*width)+col+1))
c(row*width+col+1) = sum | enddo
enddo | c(row*width+col+1) = sum
enddo | enddo
| enddo
| !$acc end loop
- reduction clause is needed when we want to sum the array or any counting inside the parallel region; this will increase the performance and avoid the error in the total sum.
- The above example shows how to use them in C/C++ and FORTRAN languages.
- Try simple
Hello_World_OpenACC.c
andHello_World_OpenACC.f90
with OpenACC parallel constructs; and try to understand what compiler producing. - Do simple
Vector_Addition_OpenACC.c
andVector_Addition_OpenACC.f90
with OpenACC parallel constructs and use data clauses for data management. - Similarly, try
Vector_Addition_OpenACC.c
andVector_Addition_OpenACC.f90
with parallel OpenACC parallel constructs and use data clauses for data management. And include thread clauses for creating threads. - Finally, do
Matrix_Multiplication_OpenACC.c
andMatrix_Multiplication_OpenACC.f90
and use reduction clause along with parallel constructs and data management clauses. - Similarly include the thread blocks for
Matrix_Multiplication_OpenACC.c
andMatrix_Multiplication_OpenACC.f90
.