This chapter provides examples of how to implement a multiGPU linear solver.
The example code enables peer-to-peer access to take advantage of NVLINK. The user can check the performance by turning on/off peer-to-peer access.
The example 1 solves linear system by Cholesky factorization (potrf
and potrs
). It allocates distributed matrix by calling createMat
. Then generates the matrix on host memory and copies it to distributed device memory via memcpyH2D
.
The example 2 solves linear system using the inverse of a Hermitian positive definite matrix using (potrf
and potri
). It allocates distributed matrix by calling createMat
. Then generates the matrix on host memory and copies it to distributed device memory via memcpyH2D
.
All GPUs supported by CUDA Toolkit (https://developer.nvidia.com/cuda-gpus)
Linux Windows
x86_64 ppc64le arm64-sbsa
- cusolverMgPotrf_bufferSize API
- cusolverMgPotrs_bufferSize API
- cusolverMgPotri_bufferSize API
- cusolverMgPotrf API
- cusolverMgPotrs API
- cusolverMgPotri API
- cusolverMgCreateDeviceGrid API
- cusolverMgDeviceSelect API
- A Linux/Windows system with recent NVIDIA drivers.
- CMake version 3.18 minimum
- Minimum CUDA 10.2 toolkit is required.
$ mkdir build
$ cd build
$ cmake .. # -DSHOW_FORMAT=ON
$ make
Make sure that CMake finds expected CUDA Toolkit. If that is not the case you can add argument -DCMAKE_CUDA_COMPILER=/path/to/cuda/bin/nvcc
to cmake command.
$ mkdir build
$ cd build
$ cmake -DCMAKE_GENERATOR_PLATFORM=x64 ..
$ Open cusolver_examples.sln project in Visual Studio and build
$ ./cusolver_MgPotrf_example1
Sample example output w/ 1 GPU:
Test 1D Laplacian of order 8
Step 1: Create Mg handle and select devices
There are 1 GPUs
Device 0, NVIDIA TITAN RTX, cc 7.5
Step 2: Enable peer access
Step 3: Allocate host memory A
Step 4: Prepare 1D Laplacian for A and X = ones(N,NRHS)
Step 5: Create RHS for reference solution on host B = A*X
Step 6: Create matrix descriptors for A and D
Step 7: Allocate distributed matrices A and B
Step 8: Prepare data on devices
Step 9: Allocate workspace space
Allocate device workspace, lwork = 1064960
Step 10: Solve A*X = B by POTRF and POTRS
Step 11: Solution vector B
Step 12: Measure residual error |b - A*x|
errors for X[:,1]
|b - A*x|_inf = 2.220446E-16
|x|_inf = 1.000000E+00
|b|_inf = 1.000000E+00
|A|_inf = 4.000000E+00
|b - A*x|/(|A|*|x|+|b|) = 4.440892E-17
errors for X[:,2]
|b - A*x|_inf = 2.220446E-16
|x|_inf = 1.000000E+00
|b|_inf = 1.000000E+00
|A|_inf = 4.000000E+00
|b - A*x|/(|A|*|x|+|b|) = 4.440892E-17
Step 12: Free resources
Sample example output w/ 2 GPU:
Test 1D Laplacian of order 8
Step 1: Create Mg handle and select devices
There are 2 GPUs
Device 0, NVIDIA TITAN RTX, cc 7.5
Device 1, NVIDIA TITAN RTX, cc 7.5
Step 2: Enable peer access
Enable peer access from gpu 0 to gpu 1
Enable peer access from gpu 1 to gpu 0
Step 3: Allocate host memory A
Step 4: Prepare 1D Laplacian for A and X = ones(N,NRHS)
Step 5: Create RHS for reference solution on host B = A*X
Step 6: Create matrix descriptors for A and D
Step 7: Allocate distributed matrices A and B
Step 8: Prepare data on devices
Step 9: Allocate workspace space
Allocate device workspace, lwork = 1064960
Step 10: Solve A*X = B by POTRF and POTRS
Step 11: Solution vector B
Step 12: Measure residual error |b - A*x|
errors for X[:,1]
|b - A*x|_inf = 2.220446E-16
|x|_inf = 1.000000E+00
|b|_inf = 1.000000E+00
|A|_inf = 4.000000E+00
|b - A*x|/(|A|*|x|+|b|) = 4.440892E-17
errors for X[:,2]
|b - A*x|_inf = 2.220446E-16
|x|_inf = 1.000000E+00
|b|_inf = 1.000000E+00
|A|_inf = 4.000000E+00
|b - A*x|/(|A|*|x|+|b|) = 4.440892E-17
Step 12: Free resources
$ ./cusolver_MgPotrf_example2
Sample example output w/ 1 GPU:
Test 1D Laplacian of order 8
Step 1: Create Mg handle and select devices
There are 1 GPUs
Device 0, NVIDIA TITAN RTX, cc 7.5
Step 2: Enable peer access
Step 3: Allocate host memory A
Step 4: Prepare 1D Laplacian for A and Xref = ones(N,NRHS)
Step 5: Create RHS for reference solution on host B = A*X
Step 6: Create matrix descriptors for A and D
Step 7: Allocate distributed matrices A and B
Step 8: Prepare data on devices
Step 9: Allocate workspace space
Allocate device workspace, lwork = 1067008
Step 10: Solve A*X = B by POTRF and POTRI
Step 11: Gather INV(A) from devices to host
Step 12: Solve linear system B := inv(A) * B
Step 13: Measure residual error |Xref - Xans|
errors for X[:,1]
|b - A*x|_inf = 4.440892E-16
|Xref|_inf = 1.000000E+00
|Xans|_inf = 1.000000E+00
|A|_inf = 4.000000E+00
|b - A*x|/(|A|*|x|+|b|) = 8.881784E-17
errors for X[:,2]
|b - A*x|_inf = 4.440892E-16
|Xref|_inf = 1.000000E+00
|Xans|_inf = 1.000000E+00
|A|_inf = 4.000000E+00
|b - A*x|/(|A|*|x|+|b|) = 8.881784E-17
Step 14: Free resources
Sample example output w/ 2 GPU:
Test 1D Laplacian of order 8
Step 1: Create Mg handle and select devices
There are 2 GPUs
Device 0, NVIDIA TITAN RTX, cc 7.5
Device 1, NVIDIA TITAN RTX, cc 7.5
Step 2: Enable peer access
Enable peer access from gpu 0 to gpu 1
Enable peer access from gpu 1 to gpu 0
Step 3: Allocate host memory A
Step 4: Prepare 1D Laplacian for A and Xref = ones(N,NRHS)
Step 5: Create RHS for reference solution on host B = A*X
Step 6: Create matrix descriptors for A and D
Step 7: Allocate distributed matrices A and B
Step 8: Prepare data on devices
Step 9: Allocate workspace space
Allocate device workspace, lwork = 1067008
Step 10: Solve A*X = B by POTRF and POTRI
Step 11: Gather INV(A) from devices to host
Step 12: Solve linear system B := inv(A) * B
Step 13: Measure residual error |Xref - Xans|
errors for X[:,1]
|b - A*x|_inf = 4.440892E-16
|Xref|_inf = 1.000000E+00
|Xans|_inf = 1.000000E+00
|A|_inf = 4.000000E+00
|b - A*x|/(|A|*|x|+|b|) = 8.881784E-17
errors for X[:,2]
|b - A*x|_inf = 4.440892E-16
|Xref|_inf = 1.000000E+00
|Xans|_inf = 1.000000E+00
|A|_inf = 4.000000E+00
|b - A*x|/(|A|*|x|+|b|) = 8.881784E-17
Step 14: Free resources