Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run singularity-eos through clang address sanitizer... and fix the HIP segfault! #437

Merged
merged 13 commits into from
Dec 3, 2024

Conversation

Yurlungur
Copy link
Collaborator

@Yurlungur Yurlungur commented Nov 29, 2024

PR Summary

I ran singularity-eos through the clang address and undefined behavior sanitizer on my laptop. It found a few minor things like undefined variables, but it also found the source of the HIP segfault. The issue was that the constructors for the pte solvers use forwarding references, e.g.,

  template <typename EOS_t, typename Real_t, typename CReal_t, typename Lambda_t>
  PORTABLE_INLINE_FUNCTION
  PTESolverFixedT(const std::size_t nmat, EOS_t &&eos, const Real vfrac_tot,
                  const Real T_true, Real_t &&rho, Real_t &&vfrac, Real_t &&sie,
                  CReal_t &&temp, Real_t &&press, Lambda_t &&lambda, Real *scratch,
                  const MixParams &params = MixParams())

but the accessors in get_sg_eos all call the constructor by passing in accessors to 2d views, e.g., &rho_pte(tid, 0) which are located from the unique token logic used to prevent over-allocated scratch memory (for example).

The problem is that the compiler is capturing these in the function call as temprorary variables, but the perfect forwarding references are then capturing them by reference, not value, meaning they go out of scope before the solver is actually used.

The issue is resolved by explicitly declaring variables for each of the accessors passed in. e.g.,

          Real *prho_pte = &rho_pte(tid, 0);
          Real *pvfrac_pte = &vfrac_pte(tid, 0);
          Real *psie_pte = &sie_pte(tid, 0);
          Real *ptemp_pte = &temp_pte(tid, 0);
          Real *ppress_pte = &press_pte(tid, 0);
          Real *pscratch = &solver_scratch(tid, 0);
          PTESolverFixedT<singularity::EOSAccessor_, Real *, Real **> method(
              npte, eos_inx, vfrac_sum, temp_pte(tid, 0), prho_pte,
              pvfrac_pte, psie_pte, ptemp_pte, ppress_pte,
              cache, pscratch);

We may wish to remove perfect forwarding down the line and instead insist these constructors capture by value. See #438 .

Here's the build command I used.

cmake -DSINGULARITY_FORCE_SUBMODULE_MODE=ON -DSINGULARITY_USE_KOKKOS=ON -DSINGULARITY_BUILD_TESTS=ON -DCMAKE_CXX_COMPILER=clang++-15 -DCMAKE_CXX_FLAGS="-fsanitize=address -fsanitize=undefined -Wall -fno-omit-frame-pointer" -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCMAKE_BUILD_TYPE=Debug ..

I added this to the github-workflows as a test. In the same test, I also add a strict check for warnings. If there are any warnings in the subset of code built, the test fails. Naturally, I also went through the code and eliminated said warnings. :)

PR Checklist

  • Adds a test for any bugs fixed. Adds tests for new features.
  • Format your changes by using the make format command after configuring with cmake.
  • Document any new features, update documentation for changes made.
  • Make sure the copyright notice on any files you modified is up to date.
  • After creating a pull request, note it in the CHANGELOG.md file.
  • LANL employees: make sure tests pass both on the github CI and on the Darwin CI

If preparing for a new release, in addition please check the following:

  • Update the version in cmake.
  • Move the changes in the CHANGELOG.md file under a new header for the new release, and reset the categories.
  • Ensure that any when='@main' dependencies are updated to the release version in the package.py

@rbberger
Copy link
Collaborator

rbberger commented Nov 29, 2024

@Yurlungur I think the issue here is that &rho_pte(tid, 0) is a pointer, but the memory location storing that pointer is a temporary if you pass it within the function call. That temporary is then passed as reference and stored as reference. However, when the call statement completes, that temporary goes out of scope and you therefore end up with a reference to an invalid memory location that was storing the pointer value.

@Yurlungur
Copy link
Collaborator Author

@Yurlungur I think the issue here is that &rho_pte(tid, 0) is a pointer, but the memory location storing that pointer is a temporary if you pass it within the function call. That temporary is then passed as reference and stored as reference. However, when the call statement completes, that temporary goes out of scope and you therefore end up with a reference to an invalid memory location that was storing the pointer value.

Thanks @rbberger . Right this makes sense. If we were passing by value, this wouldn't be a problem right? Its because we use forwarding references in our accessors, which allows the compiler to bind them as references.

I wonder if this suggests that we should insist our accessors be passed by value. And we just assume reference semantics under the hood, which is how C-style arrays and views work anyway.

Copy link
Collaborator Author

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I went through and cleaned up more warnings. I also added the clang-sanitizer test as a github workflow. This required adding some new cmake build options to support that. I made this test also fail if it encounters any warnings, which I think will help us catch problematic code going forward. The clang sanitizer checks address/memory errors as well as looks for undefined behavior.

I also snuck in some submodule updates (which shouldn't impact the spack builds at all).

@@ -359,14 +359,14 @@ PORTABLE_INLINE_FUNCTION Real Gruneisen::PressureFromDensityInternalEnergy(
template <typename Indexer_t>
PORTABLE_INLINE_FUNCTION Real
Gruneisen::MinInternalEnergyFromDensity(const Real rho_in, Indexer_t &&lambda) const {
const Real rho = std::min(rho_in, _rho_max);
// const Real rho = std::min(rho_in, _rho_max);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are unused var warnings but I want to keep the vars here as a comment, for if/when we fill this in

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call!

@@ -370,7 +370,7 @@ MGUsup::FillEos(Real &rho, Real &temp, Real &sie, Real &press, Real &cv, Real &b
}
if (output & thermalqs::temperature)
temp = TemperatureFromDensityInternalEnergy(rho, sie);
if (output & thermalqs::specific_internal_energy) sie = sie;
// if (output & thermalqs::specific_internal_energy) sie = sie;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silences warning from self-assignment. But I want to leave it in as a comment.

Comment on lines +43 to +44
: _a0(a0), _a1(a1), _a2c(a2c), _a2e(a2e), _a3(a3), _b0(b0), _b1(b1), _b2c(b2c),
_b2e(b2e), _b3(b3), _rho0(rho0) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silences warning about initializer list ordering.

Comment on lines 57 to 68
// JMM: Address sanitizer likes these named. My guess is
// that the forwarding references are not being resolved
// properly.
Real *prho_pte = &rho_pte(tid, 0);
Real *pvfrac_pte = &vfrac_pte(tid, 0);
Real *psie_pte = &sie_pte(tid, 0);
Real *ptemp_pte = &temp_pte(tid, 0);
Real *ppress_pte = &press_pte(tid, 0);
Real *pscratch = &solver_scratch(tid, 0);
PTESolverRhoT<singularity::EOSAccessor_, Real *, Real **> method(
npte, eos_inx, vfrac_sum, sie_v(i), &rho_pte(tid, 0), &vfrac_pte(tid, 0),
&sie_pte(tid, 0), &temp_pte(tid, 0), &press_pte(tid, 0), cache,
&solver_scratch(tid, 0));
npte, eos_inx, vfrac_sum, sie_v(i), prho_pte, pvfrac_pte, psie_pte,
ptemp_pte, ppress_pte, cache, pscratch);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and similar changes) silence the segfault. :) We may want to do more, though. See issue #438

Comment on lines +49 to +55
# Optionally build these if you want to build/test the fortran
# infrastructure without invoking a fortran compiler If fortran is
# off, you can set this. If fortran is on, it's forced to ON and is
# not set-able.
cmake_dependent_option(SINGULARITY_BUILD_FORTRAN_BACKEND
"Build the C++ code to which the fortran bindings bind"
OFF "NOT SINGULARITY_USE_FORTRAN" ON)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this mainly to support building with clang, because flang is still not super well supported, and I wanted to be able to run the C++ code through clang sanitizer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great idea! Thanks for adding it

@@ -590,6 +603,7 @@ endif()

if(SINGULARITY_BUILD_TESTS)
include(CTest)
enable_testing()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how tests ever ran without this. Did cmake get stricter at some point?

Comment on lines -33 to 39
constexpr Real GPa = 1.0e10;
constexpr Real MJ_per_kg = 1.0e10;

#ifdef SINGULARITY_TEST_SESAME
#ifdef SINGULARITY_USE_SPINER_WITH_HDF5

constexpr Real GPa = 1.0e10;
constexpr Real MJ_per_kg = 1.0e10;

using singularity::DavisProducts;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silences unused variable warning when this test is disabled.

Comment on lines -106 to +109
int N_;
Real *P_;
Real *rho_;
Real *sie_;
int N_;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silences initializer list warning

utils/kokkos Outdated
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I snuck in submodule updates to more recent versions of kokkos, ports-of-call, and kokkos-kernels. Only impacts submodule mode, but the versions we were pinned to were pretty old. This also silences some warnings from ports of call. @pdmullen this should fix the warnings you guys were seeing in artemis. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test will fail if the sanitizer is unhappy OR if there are any warnings in the set of code the code tests. I want to be stricter about this going forward, as I think it will help us catch problems like this in the future.

@Yurlungur
Copy link
Collaborator Author

@dholladay00 @jhp-lanl @rbberger @mauneyc-LANL this is ready for review.

@jdolence this resolves the HIP segfault. I don't think it was going to be an issue for riot, but now we can be confident.

@pdmullen this should silence the warnings you were seeing in artemis.

@Yurlungur Yurlungur added bug Something isn't working build Something to do with the build system clean-up Changes that don't affect code execution but make the code cleaner Testing Additions/changes to the testing infrastruture labels Nov 29, 2024
@Yurlungur
Copy link
Collaborator Author

All tests now passing on re-git including the HIP tests that were previously failing.

@Yurlungur
Copy link
Collaborator Author

OK also added checks for compiler warnings from GCC, which caught a few more things, now cleaned up.

-DSINGULARITY_FORCE_SUBMODULE_MODE=ON \
-DSINGULARITY_USE_KOKKOS=ON \
..
make -j4
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to run the code, this is just to check for warnings.

Comment on lines -38 to 41
static struct ThermalUnitsInit {
[[maybe_unused]] static struct ThermalUnitsInit {
} thermal_units_init_tag;
static struct LengthTimeUnitsInit {
[[maybe_unused]] static struct LengthTimeUnitsInit {
} length_time_units_init_tag;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang didn't care that these tags might be unused, maybe because it could tell they were for tag dispatch. But gcc did care. [maybe_unused] is a C++17 feature, so we can safely use it and assume it's compiler agnostic.

Copy link
Collaborator

@jhp-lanl jhp-lanl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with some minor comments/requests.

singularity-eos/eos/get_sg_eos_rho_e.cpp Outdated Show resolved Hide resolved
Comment on lines +49 to +55
# Optionally build these if you want to build/test the fortran
# infrastructure without invoking a fortran compiler If fortran is
# off, you can set this. If fortran is on, it's forced to ON and is
# not set-able.
cmake_dependent_option(SINGULARITY_BUILD_FORTRAN_BACKEND
"Build the C++ code to which the fortran bindings bind"
OFF "NOT SINGULARITY_USE_FORTRAN" ON)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great idea! Thanks for adding it

CMakeLists.txt Outdated Show resolved Hide resolved
@Yurlungur
Copy link
Collaborator Author

Merging this now, but still open to feedback if anyone else reviews. I can submit an MR later.

@Yurlungur
Copy link
Collaborator Author

Also thanks for the review @jhp-lanl !

@Yurlungur Yurlungur merged commit 668d71a into main Dec 3, 2024
8 checks passed
@Yurlungur Yurlungur deleted the jmm/sanitize branch December 3, 2024 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build Something to do with the build system clean-up Changes that don't affect code execution but make the code cleaner Testing Additions/changes to the testing infrastruture
Projects
None yet
Development

Successfully merging this pull request may close these issues.

upgrade kokkos and kokkos kernels submodules
5 participants