Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing Tempus adjoint tests #1008

Open
ikalash opened this issue Nov 8, 2023 · 23 comments
Open

Failing Tempus adjoint tests #1008

ikalash opened this issue Nov 8, 2023 · 23 comments

Comments

@ikalash
Copy link
Collaborator

ikalash commented Nov 8, 2023

The demoPDEs tests that use adjoints from Tempus started failing yesterday 11/7:

demoPDEs_Advection1D_Scalar_Param_Adjoint_Sens_Explicit
demoPDEs_Advection1D_with_Source_Dist_Param_Adjoint_Sens_Explicit_ConsistentM
demoPDEs_Thermal1D_with_Source_Dist_Param_Adjoint_Sens_Explicit

https://sems-cdash-son.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=54052

It looks like there is an Amesos2 KLU2 error that happens after the time-integration is complete, it appears due to a messed up matrix that it is given:

p=0: *** Caught standard std::exception of type 'std::runtime_error' :

 /projects/albany/nightlyAlbanyCDash/repos/Trilinos/packages/amesos2/src/Amesos2_KLU2_def.hpp:222:
 
 Throw number = 1
 
 Throw test that evaluated to true: info > 0
 
 KLU2 numeric factorization failed

p=3: *** Caught standard std::exception of type 'std::runtime_error' :

 /projects/albany/nightlyAlbanyCDash/repos/Trilinos/packages/amesos2/src/Amesos2_KLU2_def.hpp:222:
 
 Throw number = 1
 
 Throw test that evaluated to true: info > 0
 
 KLU2 numeric factorization failed

p=1: *** Caught standard std::exception of type 'std::runtime_error' :

 /projects/albany/nightlyAlbanyCDash/repos/Trilinos/packages/amesos2/src/Amesos2_KLU2_def.hpp:222:
 
 Throw number = 1
 
 Throw test that evaluated to true: info > 0
 
 KLU2 numeric factorization failed

p=2: *** Caught standard std::exception of type 'std::runtime_error' :

 /projects/albany/nightlyAlbanyCDash/repos/Trilinos/packages/amesos2/src/Amesos2_KLU2_def.hpp:222:
 
 Throw number = 1
 
 Throw test that evaluated to true: info > 0

I am wondering if this is related to recent changes to Tempus. Tagging @ccober6 who might have ideas about this theory.

I will investigate further.

@mperego
Copy link
Collaborator

mperego commented Nov 8, 2023

@ikalash I don't think it's Tempus changes, as they were only touching BDF tests.
@cgcgcg could it be the changes in MueLU you push two days ago? In these tests we are using MueLu (Stratimikos) default options.

@cgcgcg
Copy link

cgcgcg commented Nov 8, 2023

@mperego Yes. I'm trying to track down what exactly happened. EMPIRE is seeing the same issue. I get different results depending on where the factorization is called...

@mperego
Copy link
Collaborator

mperego commented Nov 8, 2023

OK. Thanks for looking into that.

@ikalash
Copy link
Collaborator Author

ikalash commented Nov 9, 2023

Yes, thanks @cgcgcg ! If you know what is the problem and are working on it, I will not test this further but will wait for your fix.

@cgcgcg
Copy link

cgcgcg commented Nov 9, 2023

Could you pull Trilinos develop and check that it works again?

@mperego
Copy link
Collaborator

mperego commented Nov 9, 2023

Our nightly tests are based on Trilinos develop. So if it's OK to wait, we'll know tomorrow morning whether the problem has been fixed.

@ikalash
Copy link
Collaborator Author

ikalash commented Nov 9, 2023

@cgcgcg : I can test it today. It's easy enough to do. Please stay tuned.

@ikalash
Copy link
Collaborator Author

ikalash commented Nov 9, 2023

I have verified that the tests pass now with a new develop Trilinos. Thanks @cgcgcg ! I will close this issue tomorrow once our CDash is clean.

@cgcgcg
Copy link

cgcgcg commented Nov 9, 2023

Nice! For now I just reverted the offending commit. We will try to get this change in again at a later date once we understand what went wrong.

@ikalash ikalash closed this as completed Nov 10, 2023
@cgcgcg
Copy link

cgcgcg commented Nov 11, 2023

With help from @mperego I was able to build Albany and run demoPDEs_Advection1D_Scalar_Param_Adjoint_Sens_Explicit.

Here is what caused the failure:

  • MueLu had switched the construction of the coarse grid solver (Amesos2 Klu) from the first solve to the preconditioner setup.
  • The test sets up a MueLu preconditioned solver for a singular matrix. It seems no solves are performed using that preconditioner.

I printed a stacktrace from the point where the factorization fails:

  *******************************************************
  ***** Belos Iterative Solver:  Block Gmres 
  ***** Maximum Iterations: 3
  ***** Block Size: 1
  ***** Residual Test: 
  *****   Test 1 : Belos::StatusTestImpResNorm<>: (2-Norm Res Vec) / (2-Norm Prec Res0), tol = 0.01
  *******************************************************
  Iter 0, [ 1] :    1.000000e+00
  Iter 1, [ 1] :    1.181981e-16
5000  1.000e+00  2.000e-04  0.000e+00  0.000e+00  1.0    0      1.477e-03  
STKDiscretization::writeSolution: writing time 1.000e+00 to index 501 in file advection1D_scalar_param_adjoint_sens_explicit_out.exo
Time = 1.000e+00
         Response[0] = -4.66293670e-16 
         Response[1] = 3.57106495e+00  
============================================================================
  Total runtime = 8.00135953e+00 sec = 1.33355992e-01 min
Fri Nov 10 19:22:59 2023
Time integration complete.

 Traceback (most recent call last):
   File unknown, in _start()
   File unknown, in __libc_start_main()
   File unknown, in main()
   File unknown, in void Piro::Detail::PerformSolveImpl<double, Thyra::VectorBase<double> const, Thyra::MultiVectorBase<double> const>(Thyra::ModelEvaluator<double> const&, Teuchos::ParameterList&, Teuchos::Array<Teuchos::RCP<Thyra::VectorBase<double> const> >&, Teuchos::Array<Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> > >&)
   File unknown, in void Piro::Detail::PerformSolveImpl<double, Thyra::VectorBase<double> const, Thyra::MultiVectorBase<double> const>(Thyra::ModelEvaluator<double> const&, Teuchos::Array<bool> const&, bool, Teuchos::Array<Teuchos::RCP<Thyra::VectorBase<double> const> >&, Teuchos::Array<Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> > >&, Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> >&, Teuchos::Array<Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> > >&, Teuchos::RCP<Piro::SolutionObserverBase<double, Thyra::VectorBase<double> const> >)
   File unknown, in Thyra::ModelEvaluatorDefaultBase<double>::evalModel(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const
   File unknown, in Piro::TempusSolver<double>::evalModelImpl(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const
   File unknown, in Tempus::IntegratorAdjointSensitivity<double>::advanceTime(double)
   File unknown, in Piro::InvertMassMatrixDecorator<double>::create_W() const
   File unknown, in void Thyra::initializeOp<double>(Thyra::LinearOpWithSolveFactoryBase<double> const&, Teuchos::RCP<Thyra::LinearOpBase<double> const> const&, Teuchos::Ptr<Thyra::LinearOpWithSolveBase<double> > const&, Thyra::ESupportSolveUse)
   File unknown, in Thyra::BelosLinearOpWithSolveFactory<double>::initializeOp(Teuchos::RCP<Thyra::LinearOpSourceBase<double> const> const&, Thyra::LinearOpWithSolveBase<double>*, Thyra::ESupportSolveUse) const
   File unknown, in Thyra::BelosLinearOpWithSolveFactory<double>::initializeOpImpl(Teuchos::RCP<Thyra::LinearOpSourceBase<double> const> const&, Teuchos::RCP<Thyra::LinearOpSourceBase<double> const> const&, Teuchos::RCP<Thyra::PreconditionerBase<double> const> const&, bool, Thyra::LinearOpWithSolveBase<double>*, Thyra::ESupportSolveUse) const
   File unknown, in Thyra::MueLuPreconditionerFactory<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::initializePrec(Teuchos::RCP<Thyra::LinearOpSourceBase<double> const> const&, Thyra::PreconditionerBase<double>*, Thyra::ESupportSolveUse) const
   File unknown, in Teuchos::RCP<MueLu::Hierarchy<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > > MueLu::CreateXpetraPreconditioner<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >(Teuchos::RCP<Xpetra::Matrix<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > >, Teuchos::ParameterList const&)
   File unknown, in MueLu::HierarchyManager<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::SetupHierarchy(MueLu::Hierarchy<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >&) const
   File unknown, in MueLu::Hierarchy<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Setup(int, Teuchos::RCP<MueLu::FactoryManagerBase const>, Teuchos::RCP<MueLu::FactoryManagerBase const>, Teuchos::RCP<MueLu::FactoryManagerBase const>)
   File unknown, in MueLu::TopSmootherFactory<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Build(MueLu::Level&) const
   File unknown, in Teuchos::RCP<MueLu::SmootherBase<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > >& MueLu::Level::Get<Teuchos::RCP<MueLu::SmootherBase<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > > >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, MueLu::FactoryBase const*)
   File unknown, in MueLu::SingleLevelFactoryBase::CallBuild(MueLu::Level&) const
   File unknown, in MueLu::SmootherFactory<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::BuildSmoother(MueLu::Level&, MueLu::PreOrPost) const
   File unknown, in MueLu::DirectSolver<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Setup(MueLu::Level&)
   File unknown, in MueLu::Amesos2Smoother<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Setup(MueLu::Level&)

The matrix that is 25x25 and has 75 entries which are all identically zero.

@ikalash
Copy link
Collaborator Author

ikalash commented Nov 11, 2023

Thanks for digging into this @cgcgcg . I think it makes sense to reopen the issue - do you agree? Unless we want to open a separate Trilinos one.

@cgcgcg
Copy link

cgcgcg commented Nov 11, 2023

Sure, let's reopen.

@ikalash ikalash reopened this Nov 11, 2023
@mperego
Copy link
Collaborator

mperego commented Nov 16, 2023

We need to understand why these tests set up a MueLu preconditioner for a singular matrix, and then not use the preconditioner. @ikalash do you have time to look into it?

@ikalash
Copy link
Collaborator Author

ikalash commented Nov 17, 2023

We need to understand why these tests set up a MueLu preconditioner for a singular matrix, and then not use the preconditioner. @ikalash do you have time to look into it?

Perhaps I misunderstood what @cgcgcg wrote, but it seems that it is the matrix at the coarsest grid level that is singular. Is that right? If so, would that suggest that there is something wrong with the matrix problem being solved using the AMG?

@cgcgcg
Copy link

cgcgcg commented Nov 17, 2023

Sorry, I should have explained better. The problem is so small that this is a one-level method. The matrix is supplied by Albany.

@ikalash
Copy link
Collaborator Author

ikalash commented Nov 17, 2023

Sorry, I should have explained better. The problem is so small that this is a one-level method. The matrix is supplied by Albany.

That's interesting. How was it working before? Was it because an iterative solve rather than a direct solve was done? Is there a branch/fork of Trilinos I can use to see the singularity / failure?

@cgcgcg
Copy link

cgcgcg commented Nov 17, 2023

The failure was triggered by MueLu switching the factorization of the coarse grid from first solve to setup. So it seems like Albany is constructing the preconditioner, but then doesn't use it to solve a system. I can provide a patch against Trilinos tomorrow morning that triggers the behavior.

@ikalash
Copy link
Collaborator Author

ikalash commented Nov 17, 2023

The failure was triggered by MueLu switching the factorization of the coarse grid from first solve to setup. So it seems like Albany is constructing the preconditioner, but then doesn't use it to solve a system. I can provide a patch against Trilinos tomorrow morning that triggers the behavior.

That would be great. I won't get to this until next week so it is no rush.

@ikalash
Copy link
Collaborator Author

ikalash commented Dec 19, 2023

I am very sorry but I still haven't had a chance to work on this. Unfortunately I am really swamped right now getting ready for 2 all-hands meetings after the shutdown and working on a few other time-critical things. Does someone else have the time to look at this issue? I can pass along instructions on how to reproduce it from @cgcgcg . Maybe we can discuss this at the Albany meeting tomorrow.

@ikalash
Copy link
Collaborator Author

ikalash commented Dec 19, 2023

I forgot to say, I am not sure when I would have a chance to look at this.

@ikalash
Copy link
Collaborator Author

ikalash commented Dec 20, 2023

Ok, per the discussion at today's Albany meeting, I switched the problematic tests so that they use Ifpack2 to avoid this issue, allowing @cgcgcg to merge his PR. We can look more at the cause of the issue in the new year when me / others have more time.

@cgcgcg : very sorry for the delay! You should be able to merge your code now that you had reverted earlier due to these test failures.

@cgcgcg
Copy link

cgcgcg commented Dec 20, 2023

No problem! Thanks for letting me know!

@ikalash
Copy link
Collaborator Author

ikalash commented Dec 20, 2023

Sure! Again, my apologies that it took so long!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants