This repository also contains the GPU-accelerated version of Quantum ESPRESSO.
This version is tested against PGI (now nvfortran) compilers v. >= 17.4.
The configure script checks for the presence of a PGI compiler and of a few
cuda libraries.For this reason path pointing to cudatoolkit must be present
in LD_LIBRARY_PATH
.
A template for the configure command is:
./configure --with-cuda=XX --with-cuda-runtime=YY --with-cuda-cc=ZZ --enable-openmp [ --with-scalapack=no ]
where XX
is the location of the CUDA Toolkit (in HPC environments is
generally $CUDA_HOME
), YY
is the version of the cuda toolkit and ZZ
is the compute capability of the card.
If you have no idea what these numbers are you may give a try to the
automatic tool get_device_props.py
. An example using Slurm is:
$ module load cuda
$ cd dev-tools
$ salloc -n1 -t1
[...]
salloc: Granted job allocation xxxx
$ srun python get_device_props.py
[...]
Compute capabilities for dev 0: 6.0
Compute capabilities for dev 1: 6.0
Compute capabilities for dev 2: 6.0
Compute capabilities for dev 3: 6.0
If all compute capabilities match, configure QE with:
./configure --with-cuda=$CUDA_HOME --with-cuda-cc=60 --with-cuda-runtime=9.2
It is generally a good idea to disable Scalapack when running small test cases since the serial GPU eigensolver can outperform the parallel CPU eigensolver in many circumstances.
From time to time PGI links to the wrong CUDA libraries anf fails reporting
a problem in cusolver
missing GOmp
(GNU Openmp). The solution to this
problem is removing cudatoolkit from the LD_LIBRARY_PATH
before compiling.
Serial compilation is also supported.
By default, GPU support is active. The following message will appear at the beginning of the output
GPU acceleration is ACTIVE.
GPU acceleration can be switched off by setting the following environment variable:
$ export USEGPU=no
The current GPU version passes all 186 tests with both parallel and serial
compilation. The testing suite should only be used to check the correctness of pw.x
.
Therefore only make run-tests-pw-parallel
and make run-tests-pw-serial
should be used.
Variables allocated on the device must end with _d
.
Subroutines and functions replicating an algorithm on the GPU must end with _gpu
.
Modules must end with _gpum
.
Files with duplicated source code must end with _gpu.f90
.
PW functionalities are ported to GPU by duplicating the subroutines and the functions that operate on CPU variables. The number of arguments should not change but input and output data may be referring to device variables when applicable.
Bifurcations in code flow happen at runtime with commands similar to
use control_flags, only : use_gpu
[...]
if (use_gpu) then
call subroutine_gpu(arg_d)
else
call subroutine(arg)
end if
At each bifurcation point it should be possible to remove the call to the accelerated routine without breaking the code. Note however that calling both the CPU and the GPU version of a subroutine in the same place may break the code execution.
[ DISCLAIMER STARTS ] What described below is not the method that will be integrated in the final release. Nonetheless it happens to be a good approach for:
- simplify the alignment of this fork with the main repository,
- debugging,
- tracing evolution of memory paths as the CPU version evolves,
- (in the future) report on a the set of global variables that should be kept to guarantee a certain speedup.
For example, this simplified the integration of the changes that took place to modernize the I/O. [ DISCLAIMER ENDS ]
Global GPU data are tightly linked to global CPU data. One cannot allocate global variables on the GPU manually. The global GPU variables follow the allocation and deallocation of the CPU ones. This is an automatic mechanism enforced by the managed memory system. In what follows, I will refer to duplicated GPU variables as "duplicated variable" and to the equivalent CPU variable as "parent variable".
Global variables in modules are synchronized through calls to subroutines
named using_xxx
and using_xxx_d
with xxx
being the name of the variable
in the module globally accessed by multiple subroutines.
This function accepts one argument that replicates the role of the intent
attribute.
Acceptable values are:
0: variable will only be read (equal to intent in)
1: variable will be read and written (equal to intent inout)
2: variable will be only (entirely) updated (equal to intent out).
Function and subroutine calls having global variables in their argument
should be guarded by calls to using_xxx
with the appropriate argument.
Obviously calls with argument 0 and 1 must always be prepended.
The actual allocation of a duplicated variable happens when using_xxx_d
is called and the parent variable is allocated.
Deallocation happens when using_xxx_d(2)
is called and the CPU variable
is not allocated.
Data synchronization (done with synchronous copies, i.e. overloaded cudamemcpy)
happens when either the CPU or the GPU memory is found to be flagged
"out of date" by a previous call to using_xxx(1)
or using_xxx(2)
or using_xxx_d(1)
or using_xxx_d(2)
.
Calls to using_xxx_d
should only happen in GPU function/subroutines.
This rule can be avoided if the call is protected by ifdefs.
This is useful if you are lazy and a global variable is updated only a few times.
An example of this being g vectors that are set in a few places (at
initialization, after a scaling of the Hamiltonian etc) and are used
everywhere in the code.
Finally, there are global variables that are only updated with subroutines residing inside the same module. The allocation and the update of the duplicated counterpart becomes trivial and is simply done at the same time as the CPU variable. At the time of writing this constitute an exception to the general rule but it is actually the result of the efforts done in the last year to modularize the code and is probably the correct method to deal with duplicated data in the code.