(sec:vectors)=
Assume we want to compute
First, we have to include the main feltor library header:
// include the dg-library
#include "dg/algorithm.h"
double a = 0.5, b = 0.25;
std::array<double, 2> x ={2,2}, y = {4,4};
// compute a*x+b*y and store it in y
dg::blas1::axpby( a, x, b, y);
// compute Sum_i x_i y_i
double sum = dg::blas1::dot( x,y);
// output should be 8
std::cout << sum << std::endl;
// 8
In this code we encounter our first two dg functions, namely dg::blas1::axpby
and dg::blas1::dot
. They perform very basic operations, namely adding vectors
and computing scalar products respectively. (You can look up their formal documentation here).
The remarkable thing about these two functions is that they are templates.
This means you can call them for many different vector classes. We can change the type of x
and y
;
// works as well
std::vector<double> x(2,2), y(2,4);
dg::blas1::axpby( a, x, b, y);
double sum = dg::blas1::dot( x,y);
std::cout << sum << std::endl;
// 8
//or
thrust::host_vector<double> x(2,2), y(2,4);
dg::blas1::axpby( a, x, b, y);
double sum = dg::blas1::dot( x,y);
std::cout << sum << std::endl;
// 8
All three versions have the same result.
All of these examples
execute on a single CPU thread, that is the compiler chooses the same
**serial** implementation of the vector addition and the scalar product.
So let us increase the vector size to say, a Million. Wouldn't it be better to perform these operations in parallel? And a measurement of the execution time would also be nice:
//use the thrust library to allocate memory on the device
thrust::device_vector<double> x(1e6,2), y(1e6,4);
//create a Timer
dg::Timer t;
//start the clock
t.tic();
dg::blas1::axpby( a, x, b, y);
//stop the clock
t.toc();
std::cout << "Axpby took "<<t.diff()<<"s\n";
t.tic();
double sum = dg::blas1::dot( x,y);
t.toc();
std::cout << "Dot took "<<t.diff()<<"s\n";
//output should be ... large
std::cout << sum << std::endl;
//Axpby took 0.913942s
//Dot took 0.883924s
//4e+06
The first thing to notice is that we now use the thrust::device_vector class. This is a vector class of the thrust library, which allocates memory on a GPU. The compiler recognizes that dg::blas1::axpby and dg::blas1::dot are now called with a GPU vector class and redirects the call to the corresponding CUDA implementation.
When using a GPU a natural question is how the synchronization is done between the GPU kernel queue and the host execution. The good news is that you do not need to manually synchronize the GPUs. The library will do it for you whenever it is needed.
If you do not have a GPU you can also define the THRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP Macro, then the call redirects to a OpenMP parallelized version or THRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CPP (like it is done in this notebook because xeus-cling does not support OpenMP or cuda very well) then the calls redirect to the serial version again. This is a specialty of the thrust::device_vector class of the thrust library that is included in dg/algorithm.h.
You could also use your own
vector class in the dg::blas1
functions. This is an advanced feature
and requires you to provide a specialization of dg::TensorTraits
for your class, where you specify the parallelization strategy that
the libary should choose and how the data is laid out in memory.
Please consult the documentation for further details on
how we dispatch the blas functions
and our template traits system.
The dg::Timer measures the time it took to execute the functions.
So, what if the vector size is even larger 1e8 say? Then an MPI implementation would be handy, wouldn't it:
The following cell will not compile if you do not have MPI installed.. You will need to copy-paste the below code to a file and compile it using the instructions in the quick-start guide.
//activate MPI in FELTOR
#include "mpi.h" //needs to be included **before** dg/algorithm.h
#include "dg/algorithm.h"
int main(int argc, char* argv[])
{
//init MPI
MPI_Init( &argc, &argv);
//let's take all processes
MPI_Comm comm = MPI_COMM_WORLD;
//get the number of MPI processes in the communicator
int np,rank;
MPI_Comm_size(comm, &np);
//get the rank of the calling process
MPI_Comm_rank(comm, &rank);
//allocate and initialize local memory
thrust::device_vector<double> x_local( 1e8/np, 2), y_local(1e8/np, 4);
//combine the local vectors to a global MPI vector
dg::MPI_Vector<thrust::device_vector<double>> x(x_local, comm);
dg::MPI_Vector<thrust::device_vector<double>> y(y_local, comm);
//now repeat the operations from before...
double a = 0.5, b = 0.25;
dg::Timer t;
t.tic();
dg::blas1::axpby( a, x, b, y);
t.toc();
if(rank==0)std::cout << "Axpby took "<<t.diff()<<"s\n";
t.tic();
double sum = dg::blas1::dot( x,y);
t.toc();
if(rank==0)std::cout << "Dot took "<<t.diff()<<"s\n";
if(rank==0)std::cout << sum << std::endl;
//be a good MPI citizen and clean up
MPI_Finalize();
return 0;
}
We have just written a hybrid MPI + X code, where X can be either serial, OpenMP or CUDA depending on how the code is compiled!
One remaining thing is that we quickly get tired
of writing thrust::device_vector<double>
and
especially dg::MPI_Vector<thrust::device_vector<double>>
.
So we invented convenient typedefs:
dg::DVec x_local( 1e8/np, 2), y_local(1e8/np, 4) dg::x::DVec x(x_local, comm), y(y_local, comm);
which is completely equivalent to the corresponding lines 18 and 20/ 21 above.
The remarkable thing in the above examples is that two lines of code never changed, even if we changed the class of vectors that we use it on
// This is platform independent code
dg::blas1::axpby( a, x, b, y);
double sum = dg::blas1::dot( x,y);
This is a first example of platform independent code.
It gives us an abstract way to perform vector operations on many different types and with different parallelization. If we implement an algorithm only in terms of dg::blas1
or similar functions, then
this algorithm is automatically parallelized for a variety of hardware and works with many different vector types.
This is in fact exactly what the dg library does. It provides a set of algorithms all implemented as templates of (at least) the vector class using only the dg::blas1 (and later dg::blas2) functions. We will encounter a first example of such an algorithm in the next chapter of the tutorial: timesteppers.
One remaining question in this chapter is: what if we do not want to add vectors
but multiply them instead? Or take the exponential of each element?
There is a selection of predefined dg::blas1
operations
you can choose from for example dg::blas1::pointwiseDot
or dg::blas1::scal
. Check out the
documentation.
If this still does not satisfy your needs, the answer probably is
the dg::blas1::subroutine
.
`blas` stands for basic linear algebra subroutins, the 1 stands for pure vector operations. In this sense `blas1::pointwiseDot` or `blas1::subroutine` are actually misnomers, since neither function is linear, nor are they part of the original blas library.
As an example, let us assume that we have two vectors v
and w
and we want to compute the elementwise expression v
and w
Lambda functions are very useful and will appear continuously throughout this guide, so it is a good idea to refresh your knowledge about them.
A good watch on youtube is for example [Lambdas from Scratch - Arthur O'Dwye](https://www.youtube.com/watch?v=3jCOwajNch0)
double a = 8.;
auto lambda = [&]DG_DEVICE( double& vi, double wi ){
vi = vi*vi + a*sin(wi);
};
dg::HVec v( 1e6,2), w(1e6,4);
dg::blas1::subroutine( lambda, v,w);
std::cout << v[0] << std::endl;
// -2.05442
In the above example we have two host vectors dg::HVec v
and w
. In a shared
memory code these will be declared as dg::DVec
In an MPI implementation we would simply write dg::x::DVec
instead of dg::DVec
.
dg::blas1::subroutine
simply calls the given custom Functor (in the above case a lambda) for each element in v
and w
. There can be an arbitrary number of arguments to the subroutine function (in fact there have to be as many as arguments to the functor/lambda).
This in fact means that dg::blas1::subroutine
is able to compute any trivially parallel
operation with any number of inputs and outputs.
If you do not think this is pure magic, stop here and read again :)
(On a technical note, this neat behaviour is possible through `C++-11`-style
template parameter packs).
In many practical problems of fluid dynamics we not only want to integrate one variable, x
say in time, but several. For example we may want to integrate both density
double a = 0.5, b = 0.25;
dg::DVec two( 1e3, 2), four( 1e3, 4);
std::array<dg::DVec, 2> x = {two,two}, y = {four,four};
// compute a*x+b*y and store it in y
dg::blas1::axpby( a, x, b, y);
// compute Sum_i x_i y_i
double sum = dg::blas1::dot( x,y);
// output should be large ...
std::cout << sum << std::endl;
// 8000
In this example we have an array of two large vectors (that are not filled with too exciting values, but it is just to show that the code still works). It is also possible to have a vector or map of vectors
double a = 0.5, b = 0.25;
dg::DVec two( 1e3, 2), four( 1e3, 4);
std::vector<dg::DVec> x {2,two}, y{2,four};
// compute a*x+b*y and store it in y
dg::blas1::axpby( a, x, b, y);
// compute Sum_i x_i y_i
double sum = dg::blas1::dot( x,y);
// output should be large ...
std::cout << sum << std::endl;
// 8000
double a = 0.5, b = 0.25;
dg::DVec two( 1e3, 2), four( 1e3, 4);
std::map<std::string, dg::DVec> x {{ "density",two}, {"velocity", two}},
y {{ "density",four}, {"velocity", four}};
// compute a*x+b*y and store it in y
dg::blas1::axpby( a, x, b, y);
// compute Sum_i x_i y_i
double sum = dg::blas1::dot( x,y);
// output should be large ...
std::cout << sum << std::endl;
// 8000
Since the implementation works recursively, the "innver vector" can be a recusive vector again that is something like std::vector<std::array<dg::DVec,2>>
is also possible.