Matrix-matrix solver on GPU #1381

blegouix · 2023-08-04T12:20:37Z

Hello,

I want to implement a matrix matrix solver AX=B with A sparse and X and B dense matrixes. To do so, I tried two strategies:

On the develop branch, use the classes Csr<>, Dense<> and Bicgstab<> with GPU executor.
On the batch-develop branch, use the classes BatchCsr<>, BatchDense<> and BatchBicgstab<>. In this case, I perform previously a duplication of the matrix A using the BatchCsr constructor which permits the duplication.

I obtain the following behaviours:

The first solution is not satisfying because the columns of B and X are processed sequentialy in n_batch 1-block kernels (no parallelization of the individual problems). The second solution is not satisfying too because it involves a huge DtoD memorycopy and fills the memory with duplicated information.

Is there already a way to do it or is the developments related to new MultiVector class appropriate for this problem and should we expect a solver anytime soon ?

Any known workaround with for loop or Kokkos::parallel_for ?

Regards,

tcojean · 2023-08-04T12:28:53Z

Hi Baptiste,

Thanks for trying Ginkgo and for your report! To get a bit more info, what is the size of your A matrix in this case?

The MultiVector is simply a rewrite of the current batch code (which uses Dense currently). We are trying to merge the entire batch functionality into our code.

MarcelKoch · 2023-08-04T12:29:31Z

Hi @blegouix, could you also specify how you are executing strategy 1? Our solvers can handle dense matrices as right-hand-side, so you can just call solver.apply(...) once.
But the batched solve might still be more efficient for small a A.

blegouix · 2023-08-04T13:14:54Z

A size is 10x10, B and X are 10x1000.

Here is the code for strategy 1:

auto b_vec_batch = gko::matrix::Dense<>::create(gko_device_exec, gko::dim<2>{n,n_batch});
b_vec_batch->fill(1);
auto A_mat = gko::matrix::Csr<>::create(gko_device_exec, gko::dim<2>{n,n}, gko::array<double>::view(gko_device_exec, n*n, mat_ptr), gko::array<int  >::view(gko_device_exec, n*n, cols), gko::array<int>::view(gko_device_exec, n+1, rows)); // mat_ptr, cols and rows are custom double* and int*
auto x_vec_batch = gko::matrix::Dense<>::create(gko_device_exec, gko::dim<2>{n,n_batch});

auto solver =
              gko::solver::Bicgstab<>::build()
                  .with_preconditioner(gko::preconditioner::Jacobi<>::build().on(gko_device_exec))
                  .with_criteria(
                      gko::stop::Iteration::build().with_max_iters(20u).on(gko_device_exec),
                      gko::stop::ResidualNorm<>::build()
                          .with_reduction_factor(1e-15)
                          .on(gko_device_exec))
                  .on(gko_device_exec);
solver->generate(A_mat)->apply(b_vec_batch, x_vec_batch);

1000 individuals kernels are launched sequentially:

pratikvn · 2023-08-04T13:27:48Z

I think strategy 1 is the right choice for this problem as you have only 1 matrix. I assume n=10 and n_batch=1000 in the code above ? And you are calling the solver->generate(A_mat)->apply(...) only once and not within a loop ?

If you want to solve it with strategy 2, then you will have to duplicate the matrix, because the batched solvers in batch-develop currently do not support multiple-right hand side solutions. It is not clear if we will support that in the future either. Duplicating the matrices will involve deep copies and allocation so unfortunately there is not much to be done there, efficiency wise, therefore we do not recommend that approach for performance sensitive applications.

blegouix · 2023-08-04T13:37:56Z

Yes I call it only once as in the provided code with n=10 and n_batch=1000. It is not the expected behavior ?

pratikvn · 2023-08-04T13:41:02Z

Yes, that is the expected behaviour. It is a bit weird that you are seeing 1000 separate kernel calls. Our monolithic solvers support multiple RHS, so there should be only one solver call. There might be multiple kernel calls inside that single solver call though (dot, norm, axpy, spmv etc) and these kernels are called every iteration, so if you have 50 iterations for example, with CG, then maybe you see something around 7*50 kernel calls.

Essentially, you do these solver operations for all rhs in parallel.

blegouix · 2023-08-04T13:45:51Z

Ok great, so I have to figure out what is causing that. It may be the solver/preconditionner choice ? I would be happy to take your ideas

MarcelKoch · 2023-08-04T13:48:28Z

You could add a Convergence logger to your solver to get how many iterations you are actually doing. If you need 1000 iterations for a 10x10 matrix, something seems to be wrong. Perhaps your matrix is actually singluar and thus the solver is not converging at all.

blegouix · 2023-08-04T14:36:45Z

I will add the Convergence logger but I can already confirm that the solver converges (I tried with A=Identity / B=ones or simple matrices like that and I obtain right result, plus error computed norm(B-AX) checked).

And I confirm that I have n_batch individual kernels launched in sequential (1000 is not the number of iterations). I tried with n_batch=10 or n_batch=100 and I counted them. This is not the number of iterations of the solver.

blegouix · 2023-08-04T15:40:24Z

Ok, this is related to the size of the matrix 10x10 and the use of a Jacobi-block preconditionner (which is obviously not the good choice). What I see is not the actual Bicgstab iteration but something related to the preconditionning, because 10 being smaller than 32, block-jacobi is enough to inverse the matrix by itself and solves directly the systems.

Great thanks,

MarcelKoch self-assigned this Aug 4, 2023

blegouix closed this as completed Aug 4, 2023

blegouix mentioned this issue Aug 16, 2023

Block Jacobi not parallelized for matrix-matrix system #1391

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matrix-matrix solver on GPU #1381

Matrix-matrix solver on GPU #1381

blegouix commented Aug 4, 2023 •

edited

Loading

tcojean commented Aug 4, 2023 •

edited

Loading

MarcelKoch commented Aug 4, 2023

blegouix commented Aug 4, 2023 •

edited

Loading

pratikvn commented Aug 4, 2023 •

edited

Loading

blegouix commented Aug 4, 2023 •

edited

Loading

pratikvn commented Aug 4, 2023 •

edited

Loading

blegouix commented Aug 4, 2023

MarcelKoch commented Aug 4, 2023

blegouix commented Aug 4, 2023 •

edited

Loading

blegouix commented Aug 4, 2023

Matrix-matrix solver on GPU #1381

Matrix-matrix solver on GPU #1381

Comments

blegouix commented Aug 4, 2023 • edited Loading

tcojean commented Aug 4, 2023 • edited Loading

MarcelKoch commented Aug 4, 2023

blegouix commented Aug 4, 2023 • edited Loading

pratikvn commented Aug 4, 2023 • edited Loading

blegouix commented Aug 4, 2023 • edited Loading

pratikvn commented Aug 4, 2023 • edited Loading

blegouix commented Aug 4, 2023

MarcelKoch commented Aug 4, 2023

blegouix commented Aug 4, 2023 • edited Loading

blegouix commented Aug 4, 2023

blegouix commented Aug 4, 2023 •

edited

Loading

tcojean commented Aug 4, 2023 •

edited

Loading

blegouix commented Aug 4, 2023 •

edited

Loading

pratikvn commented Aug 4, 2023 •

edited

Loading

blegouix commented Aug 4, 2023 •

edited

Loading

pratikvn commented Aug 4, 2023 •

edited

Loading

blegouix commented Aug 4, 2023 •

edited

Loading