Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

periodic_test fails #81

Open
cwsmith opened this issue Jan 24, 2024 · 4 comments
Open

periodic_test fails #81

cwsmith opened this issue Jan 24, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@cwsmith
Copy link

cwsmith commented Jan 24, 2024

The periodic_test with a build of master with the Kokkos Serial backend fails with a seg fault. Below is the output of valgrind from one of the two processes; the other process had a similar trace.

Omega_h cmake args:

$ cat Omega_h_cmake_args.txt
-DBUILD_TESTING:BOOL="on" -DBUILD_SHARED_LIBS:BOOL="on" -DCMAKE_INSTALL_PREFIX:PATH="/space/cwsmith/omegahKkVersions/buildOmegahSimKokkosSerialMpion_master/install" -DOmega_h_USE_Kokkos:BOOL="on" -DKokkos_PREFIX:PATH="/space/cwsmith/omegahKkVersions/buildKokkos/install" -DOmega_h_USE_SimModSuite:BOOL="on" -DOmega_h_USE_MPI:BOOL="on" -DOmega_h_USE_MPI:BOOL="on" -DOmega_h_USE_Kokkos:BOOL="on" -DKokkos_PREFIX:PATH="/space/cwsmith/omegahKkVersions/buildKokkos/install" -DOmega_h_USE_MPI:BOOL="on" -DOmega_h_USE_OpenMP:BOOL="OFF" -DOmega_h_USE_CUDA:BOOL="OFF"

Versions

omegah - master @ c5f1dc9d
kokkos - develop @ ed08974c7 (newer than last tagged version of 4.2.00)
simmetrix simmodsuite - 2023.1-230907dev

Valgrind output:

==3612296== Memcheck, a memory error detector
==3612296== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==3612296== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==3612296== Command: ./src/periodic_test /space/cwsmith/omegahKkVersions/omega_h_master/meshes/wedge_matchZ_12elem.sms /space/cwsmith/omegahKkVersions/omega_h_master/meshes/wedge_match.smd /space/cwsmith/omegahKkVersions/omega_h_master/meshes/wedge_matchZ_12elem_sync_2.osh 2
==3612296== Parent PID: 3612294
==3612296==
==3612296== Invalid read of size 4
==3612296==    at 0x6654270: host_atomic_fetch_oper<desul::Impl::sub_operator<int, int const>, int, desul::MemoryOrderRelaxed> (Fetch_Op_ScopeCaller.hpp:44)
==3612296==    by 0x6654270: host_atomic_fetch_sub<int, desul::MemoryOrderRelaxed, desul::MemoryScopeCaller> (Fetch_Op_Generic.hpp:40)
==3612296==    by 0x6654270: atomic_fetch_sub<int, desul::MemoryOrderRelaxed, desul::MemoryScopeCaller> (Generic.hpp:60)
==3612296==    by 0x6654270: atomic_fetch_sub<int> (Kokkos_Atomics_Desul_Wrapper.hpp:83)
==3612296==    by 0x6654270: Kokkos::Impl::SharedAllocationRecord<void, void>::decrement(Kokkos::Impl::SharedAllocationRecord<void, void>*) (Kokkos_SharedAlloc.cpp:212)
==3612296==    by 0x5213382: assign_direct (Kokkos_SharedAlloc.hpp:477)
==3612296==    by 0x5213382: Kokkos::Impl::ViewTracker<Kokkos::View<int*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > >::operator=(Kokkos::Impl::ViewTracker<Kokkos::View<int*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > > const&) (Kokkos_ViewTracker.hpp:79)
==3612296==    by 0x521076E: Kokkos::View<int*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >::operator=(Kokkos::View<int*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > const&) (Kokkos_View.hpp:1288)
==3612296==    by 0x520BA08: Omega_h::Write<int>::operator=(Omega_h::Write<int> const&) (Omega_h_array.hpp:49)
==3612296==    by 0x5221F08: Omega_h::Read<int>::operator=(Omega_h::Read<int> const&) (Omega_h_array.hpp:88)
==3612296==    by 0x5451023: Omega_h::Mesh::copy_meta() const (Omega_h_mesh.cpp:1235)
==3612296==    by 0x54BE3C9: Omega_h::migrate_mesh(Omega_h::Mesh*, Omega_h::Dist, Omega_h_Parting, bool) (Omega_h_migrate.cpp:383)
==3612296==    by 0x544D863: Omega_h::Mesh::balance(bool) (Omega_h_mesh.cpp:956)
==3612296==    by 0x41CFCF: main (periodic_test.cpp:61)
==3612296==  Address 0x38 is not stack'd, malloc'd or (recently) free'd
==3612296==
==3612296==
==3612296== Process terminating with default action of signal 11 (SIGSEGV)
==3612296==  Access not within mapped region at address 0x38
==3612296==    at 0x6654270: host_atomic_fetch_oper<desul::Impl::sub_operator<int, int const>, int, desul::MemoryOrderRelaxed> (Fetch_Op_ScopeCaller.hpp:44)
==3612296==    by 0x6654270: host_atomic_fetch_sub<int, desul::MemoryOrderRelaxed, desul::MemoryScopeCaller> (Fetch_Op_Generic.hpp:40)
==3612296==    by 0x6654270: atomic_fetch_sub<int, desul::MemoryOrderRelaxed, desul::MemoryScopeCaller> (Generic.hpp:60)
==3612296==    by 0x6654270: atomic_fetch_sub<int> (Kokkos_Atomics_Desul_Wrapper.hpp:83)
==3612296==    by 0x6654270: Kokkos::Impl::SharedAllocationRecord<void, void>::decrement(Kokkos::Impl::SharedAllocationRecord<void, void>*) (Kokkos_SharedAlloc.cpp:212)
==3612296==    by 0x5213382: assign_direct (Kokkos_SharedAlloc.hpp:477)
==3612296==    by 0x5213382: Kokkos::Impl::ViewTracker<Kokkos::View<int*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > >::operator=(Kokkos::Impl::ViewTracker<Kokkos::View<int*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > > const&) (Kokkos_ViewTracker.hpp:79)
==3612296==    by 0x521076E: Kokkos::View<int*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >::operator=(Kokkos::View<int*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> > const&) (Kokkos_View.hpp:1288)
==3612296==    by 0x520BA08: Omega_h::Write<int>::operator=(Omega_h::Write<int> const&) (Omega_h_array.hpp:49)
==3612296==    by 0x5221F08: Omega_h::Read<int>::operator=(Omega_h::Read<int> const&) (Omega_h_array.hpp:88)
==3612296==    by 0x5451023: Omega_h::Mesh::copy_meta() const (Omega_h_mesh.cpp:1235)
==3612296==    by 0x54BE3C9: Omega_h::migrate_mesh(Omega_h::Mesh*, Omega_h::Dist, Omega_h_Parting, bool) (Omega_h_migrate.cpp:383)
==3612296==    by 0x544D863: Omega_h::Mesh::balance(bool) (Omega_h_mesh.cpp:956)
==3612296==    by 0x41CFCF: main (periodic_test.cpp:61)
==3612296==  If you believe this happened as a result of a stack
==3612296==  overflow in your program's main thread (unlikely but
==3612296==  possible), you can try to increase the size of the
==3612296==  main thread stack using the --main-stacksize= flag.
==3612296==  The main thread stack size used in this run was 8388608.
==3612296==
==3612296== HEAP SUMMARY:
==3612296==     in use at exit: 13,116,178 bytes in 4,205 blocks
==3612296==   total heap usage: 15,374 allocs, 11,169 frees, 14,496,121 bytes allocated
==3612296==
==3612296== LEAK SUMMARY:
==3612296==    definitely lost: 0 bytes in 0 blocks
==3612296==    indirectly lost: 0 bytes in 0 blocks
==3612296==      possibly lost: 10,525 bytes in 206 blocks
==3612296==    still reachable: 13,105,653 bytes in 3,999 blocks
==3612296==         suppressed: 0 bytes in 0 blocks
==3612296== Rerun with --leak-check=full to see details of leaked memory
==3612296==
==3612296== For lists of detected and suppressed errors, rerun with: -s
==3612296== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
@cwsmith cwsmith added the bug Something isn't working label Jan 24, 2024
cwsmith added a commit that referenced this issue Jan 24, 2024
@joshia5
Copy link
Collaborator

joshia5 commented Feb 1, 2024

At the time of development, the test passed with cuda backend and did not show any errors when running valgrind

@joshia5
Copy link
Collaborator

joshia5 commented Feb 1, 2024

starting point of debugging would be to debug or 'step' into the "migrate_matches" routine, I am not sure when I'll be able to replicate and work on fixing this issue

@joshia5
Copy link
Collaborator

joshia5 commented Feb 1, 2024

@cwsmith is it possible this is a new kokkos/gpu-backend issue?

@cwsmith
Copy link
Author

cwsmith commented Feb 1, 2024

Good question. I can check that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants