Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matrix-matrix solver on GPU #1381

Closed
blegouix opened this issue Aug 4, 2023 · 10 comments
Closed

Matrix-matrix solver on GPU #1381

blegouix opened this issue Aug 4, 2023 · 10 comments
Assignees

Comments

@blegouix
Copy link

blegouix commented Aug 4, 2023

Hello,

I want to implement a matrix matrix solver AX=B with A sparse and X and B dense matrixes. To do so, I tried two strategies:

  • On the develop branch, use the classes Csr<>, Dense<> and Bicgstab<> with GPU executor.
  • On the batch-develop branch, use the classes BatchCsr<>, BatchDense<> and BatchBicgstab<>. In this case, I perform previously a duplication of the matrix A using the BatchCsr constructor which permits the duplication.

I obtain the following behaviours:

batch_compare_ginkgo

The first solution is not satisfying because the columns of B and X are processed sequentialy in n_batch 1-block kernels (no parallelization of the individual problems). The second solution is not satisfying too because it involves a huge DtoD memorycopy and fills the memory with duplicated information.

Is there already a way to do it or is the developments related to new MultiVector class appropriate for this problem and should we expect a solver anytime soon ?

Any known workaround with for loop or Kokkos::parallel_for ?

Regards,

@MarcelKoch MarcelKoch self-assigned this Aug 4, 2023
@tcojean
Copy link
Member

tcojean commented Aug 4, 2023

Hi Baptiste,

Thanks for trying Ginkgo and for your report! To get a bit more info, what is the size of your A matrix in this case?

The MultiVector is simply a rewrite of the current batch code (which uses Dense currently). We are trying to merge the entire batch functionality into our code.

@MarcelKoch
Copy link
Member

Hi @blegouix, could you also specify how you are executing strategy 1? Our solvers can handle dense matrices as right-hand-side, so you can just call solver.apply(...) once.
But the batched solve might still be more efficient for small a A.

@blegouix
Copy link
Author

blegouix commented Aug 4, 2023

A size is 10x10, B and X are 10x1000.

Here is the code for strategy 1:

auto b_vec_batch = gko::matrix::Dense<>::create(gko_device_exec, gko::dim<2>{n,n_batch});
b_vec_batch->fill(1);
auto A_mat = gko::matrix::Csr<>::create(gko_device_exec, gko::dim<2>{n,n}, gko::array<double>::view(gko_device_exec, n*n, mat_ptr), gko::array<int  >::view(gko_device_exec, n*n, cols), gko::array<int>::view(gko_device_exec, n+1, rows)); // mat_ptr, cols and rows are custom double* and int*
auto x_vec_batch = gko::matrix::Dense<>::create(gko_device_exec, gko::dim<2>{n,n_batch});

auto solver =
              gko::solver::Bicgstab<>::build()
                  .with_preconditioner(gko::preconditioner::Jacobi<>::build().on(gko_device_exec))
                  .with_criteria(
                      gko::stop::Iteration::build().with_max_iters(20u).on(gko_device_exec),
                      gko::stop::ResidualNorm<>::build()
                          .with_reduction_factor(1e-15)
                          .on(gko_device_exec))
                  .on(gko_device_exec);
solver->generate(A_mat)->apply(b_vec_batch, x_vec_batch);

1000 individuals kernels are launched sequentially:

image

@pratikvn
Copy link
Member

pratikvn commented Aug 4, 2023

I think strategy 1 is the right choice for this problem as you have only 1 matrix. I assume n=10 and n_batch=1000 in the code above ? And you are calling the solver->generate(A_mat)->apply(...) only once and not within a loop ?

If you want to solve it with strategy 2, then you will have to duplicate the matrix, because the batched solvers in batch-develop currently do not support multiple-right hand side solutions. It is not clear if we will support that in the future either. Duplicating the matrices will involve deep copies and allocation so unfortunately there is not much to be done there, efficiency wise, therefore we do not recommend that approach for performance sensitive applications.

@blegouix
Copy link
Author

blegouix commented Aug 4, 2023

Yes I call it only once as in the provided code with n=10 and n_batch=1000. It is not the expected behavior ?

@pratikvn
Copy link
Member

pratikvn commented Aug 4, 2023

Yes, that is the expected behaviour. It is a bit weird that you are seeing 1000 separate kernel calls. Our monolithic solvers support multiple RHS, so there should be only one solver call. There might be multiple kernel calls inside that single solver call though (dot, norm, axpy, spmv etc) and these kernels are called every iteration, so if you have 50 iterations for example, with CG, then maybe you see something around 7*50 kernel calls.

Essentially, you do these solver operations for all rhs in parallel.

@blegouix
Copy link
Author

blegouix commented Aug 4, 2023

Ok great, so I have to figure out what is causing that. It may be the solver/preconditionner choice ? I would be happy to take your ideas

@MarcelKoch
Copy link
Member

You could add a Convergence logger to your solver to get how many iterations you are actually doing. If you need 1000 iterations for a 10x10 matrix, something seems to be wrong. Perhaps your matrix is actually singluar and thus the solver is not converging at all.

@blegouix
Copy link
Author

blegouix commented Aug 4, 2023

I will add the Convergence logger but I can already confirm that the solver converges (I tried with A=Identity / B=ones or simple matrices like that and I obtain right result, plus error computed norm(B-AX) checked).

And I confirm that I have n_batch individual kernels launched in sequential (1000 is not the number of iterations). I tried with n_batch=10 or n_batch=100 and I counted them. This is not the number of iterations of the solver.

@blegouix
Copy link
Author

blegouix commented Aug 4, 2023

Ok, this is related to the size of the matrix 10x10 and the use of a Jacobi-block preconditionner (which is obviously not the good choice). What I see is not the actual Bicgstab iteration but something related to the preconditionning, because 10 being smaller than 32, block-jacobi is enough to inverse the matrix by itself and solves directly the systems.

Great thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants