Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lassen: No MPI 4+ Support #5759

Merged
merged 3 commits into from
Mar 15, 2025
Merged

Conversation

ax3l
Copy link
Member

@ax3l ax3l commented Mar 11, 2025

  • mpi4py>=4.0 was released and supports (and requires) MPI 4 features. But Lassen does not support MPI 4 in the IBM Spectrum rolling releases.
    Thus, limit the upper versions of mpi4py for now.

  • Making the compilers for h5py a bit more robust, using the Lassen-specific wrapper name. https://hpc.llnl.gov/documentation/tutorials/using-lc-s-sierra-systems#Compilers

  • The Lassen TOSS4 upgrade never was shipped, so we can simplify our paths and names again. Some confusion was left here in paths that prevented a smooth install

cc @bzdjordje

Fix #5728

To Do

  • compiles
  • exe: runs without errors
  • python: runs without errors

@ax3l ax3l added bug Something isn't working install component: third party Changes in WarpX that reflect a change in a third-party library bug: affects latest release Bug also exists in latest release version machine / system Machine or system-specific issue labels Mar 11, 2025
@ax3l ax3l changed the title Lassen: No MPI4 Support Lassen: No MPI 4+ Support Mar 11, 2025
@ax3l ax3l force-pushed the fix-lassen-no-mpi4 branch from 517e7d3 to a8181b7 Compare March 11, 2025 23:59
@ax3l ax3l force-pushed the fix-lassen-no-mpi4 branch from a8181b7 to d646054 Compare March 12, 2025 01:22
`mpi4py>=4.0` were released and support MPI 4 features.
But Lassen does not support MPI 4 in the IBM Spectrum
rolling releases.

Thus, limit the upper versions of `mpi4py` for now.

Also making the compilers for `h5py` a bit more robust, using
the Lassen-specific wrapper name (hey, thanks for being special).
https://hpc.llnl.gov/documentation/tutorials/using-lc-s-sierra-systems#Compilers
@ax3l ax3l force-pushed the fix-lassen-no-mpi4 branch 3 times, most recently from ed63d5f to 51c7ad7 Compare March 14, 2025 21:45
TOSS4 never arrived.
@ax3l ax3l force-pushed the fix-lassen-no-mpi4 branch from 51c7ad7 to ffe04a9 Compare March 14, 2025 21:59
@ax3l
Copy link
Member Author

ax3l commented Mar 15, 2025

Hm, I see segfaults of the form

 3: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/pami_port/libpami.so.3(_ZN4PAMI6Device5Shmem6PacketINS_4Fifo10FifoPacketILj64ELj4096EEEE12writePayloadERS5_Pvm+0xe8
) [0x200055f46fb8]
    ?? ??:0

 4: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/pami_port/libpami.so.3(_ZN4PAMI6Device9Interface11PacketModelINS0_5Shmem11PacketModelINS0_11ShmemDeviceINS_4Fifo8Wr
apFifoINS6_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAtomicEEELj256EEENSA_8IndirectINSA_6NativeEEENS3_9CMAShaddrELj256ELj512EEEEEE15postMultiPacketILj512EEEbRAT__hPFvPv
SQ_13pami_result_tESQ_mmSQ_mSQ_m+0x304) [0x200055f65ae4]
    ?? ??:0

 5: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/pami_port/libpami.so.3(_ZN4PAMI8Protocol4Send11EagerSimpleINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo
8WrapFifoINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEELNS1_15configuration_tE1EE11simple_i
mplEP11pami_send_t+0x434) [0x200055f78e74]
    ?? ??:0

 6: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/pami_port/libpami.so.3(_ZN4PAMI8Protocol4Send5EagerINS_6Device5Shmem11PacketModelINS3_11ShmemDeviceINS_4Fifo8WrapFi
foINS7_10FifoPacketILj64ELj4096EEENS_7Counter15IndirectBoundedINS_6Atomic12NativeAtomicEEELj256EEENSB_8IndirectINSB_6NativeEEENS4_9CMAShaddrELj256ELj512EEEEENS3_3IBV14GpuPacketModelINSN_6DeviceELb0EEE
E9EagerImplILNS1_15configuration_tE1ELb1EE6simpleEP11pami_send_t+0x2c) [0x200055f7916c]
    ?? ??:0

 7: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/pami_port/libpami.so.3(PAMI_Send+0x58) [0x200055ea5758]
    ?? ??:0

 8: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/spectrum_mpi/mca_pml_pami.so(pml_pami_send+0x6d8) [0x200055cdf6c8]
    ?? ??:0

 9: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/container/../lib/spectrum_mpi/mca_pml_pami.so(mca_pml_pami_isend+0x568) [0x200055ce0658]
    ?? ??:0

10: /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpi_ibm.so.3(MPI_Isend+0x160) [0x200051fb4750]
    ?? ??:0

11: ./warpx.rz() [0x1079b6b8]
    amrex::ParallelDescriptor::Message amrex::ParallelDescriptor::Asend<char>(char const*, unsigned long, int, int, ompi_communicator_t*) at ??:?

12: ./warpx.rz() [0x10383bf8]
    void amrex::communicateParticlesStart<amrex::ParticleContainer_impl<amrex::SoAParticle<7, 0>, 7, 0, amrex::ArenaAllocator, amrex::DefaultAssignor>, amrex::PODVector<char, amrex::PolymorphicArenaAllocator<char> >, amrex::PODVector<char, amrex::PolymorphicArenaAllocator<char> >, 0>(amrex::ParticleContainer_impl<amrex::SoAParticle<7, 0>, 7, 0, amrex::ArenaAllocator, amrex::DefaultAssignor> const&, amrex::ParticleCopyPlan&, amrex::PODVector<char, amrex::PolymorphicArenaAllocator<char> > const&, amrex::PODVector<char, amrex::PolymorphicArenaAllocator<char> >&) [clone .isra.0] at tmpxft_00018e9c_00000000-6_MultiParticleContainer.cudafe1.cpp:?

13: ./warpx.rz() [0x10385dcc]
    amrex::ParticleContainer_impl<amrex::SoAParticle<7, 0>, 7, 0, amrex::ArenaAllocator, amrex::DefaultAssignor>::RedistributeGPU(int, int, int, int, bool) at ??:?

@ax3l
Copy link
Member Author

ax3l commented Mar 15, 2025

and

3: 1: warpx.rz: /__SMPI_build_dir_______________________________________/ibmsrc/pami/ibm-pami/buildtools/pami_build_port/../pami/components/devices/shmem/shaddr/CMAShaddr.h:164: size_t PAMI::Device::Shmem::CMAShaddr::read_impl(PAMI::Memregion*, size_t, PAMI::Memregion*, size_t, size_t, bool*): Assertion `cbytes > 0' failed.

@ax3l
Copy link
Member Author

ax3l commented Mar 15, 2025

They go away if I remove the -M "-gpu" from the jsrun line.

  -M, --smpiargs=<SMPI args> Quoted argument list meaningful for Spectrum MPI
                             applications.

Causes segfaults
@ax3l
Copy link
Member Author

ax3l commented Mar 15, 2025

@bzdjordje this fixes it for me. Maybe all you need to do is update your jsrun line.

@ax3l
Copy link
Member Author

ax3l commented Mar 15, 2025

Merging to show fully working example to the user in the live docs & mainline scripts.
https://warpx.readthedocs.io/en/latest/install/hpc/lassen.html

@ax3l ax3l merged commit bdcb685 into BLAST-WarpX:development Mar 15, 2025
30 of 36 checks passed
@ax3l ax3l deleted the fix-lassen-no-mpi4 branch March 15, 2025 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug: affects latest release Bug also exists in latest release version bug Something isn't working component: third party Changes in WarpX that reflect a change in a third-party library install machine / system Machine or system-specific issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rebuilding on Lassen
1 participant