Work-Around: Segfault in MPI_Init with HIP #4237

ax3l · 2023-08-25T23:36:02Z

See:
https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html#olcfdev-1655-occasional-seg-fault-during-mpi-init

Proposed work-around for @tmsclark in #4236.

I think this might be a general defect of GPU-aware MPI implementations from HPE/Cray at this point, explicit device context init before MPI init can help establishing GPU-aware MPI init assumptions at this point.
Clarifying with AMD if we can have a check for an already existing HIP/ROCm initialized runtime, in case we want to move this safety net to AMReX at some point, too. (Also, not clear if hipInit is idempotent.)

make part of ablastr::parallelization::mpi_init?
test on AMD GPUs, i.e. Frontier, LUMI

See: https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html#olcfdev-1655-occasional-seg-fault-during-mpi-init

All that counts is that HIP is initialized before GPU-aware MPI.

Source/ablastr/parallelization/MPIInitHelpers.cpp

lucafedeli88

Thanks really a lot for this PR! I've left a small comment. I'll test these changes as soon as possible.

lucafedeli88 · 2023-08-26T23:16:23Z

Source/ablastr/parallelization/MPIInitHelpers.cpp

+#if defined(AMREX_USE_HIP) && defined(AMREX_USE_MPI)
+        hipError_t hip_ok = hipInit(0);
+        if (hip_ok != hipSuccess) {
+            std::cerr << "hipInit failed with error code " << hip_ok << "! Aborting now.\n";


Is there a reason for not using ABLASTR_ALWAYS_ASSERT_WITH_MESSAGE here?

Yes: anything calling into AMReX functions cannot be used (safely) before AMReX is initialized. amrex::Assert implements multiple things, not all of them work with an uninitialized AMReX context.

For instance, not even amrex::Print() works before init, I had to develop a work-around falling back to all-print in pyAMReX: AMReX-Codes/pyamrex#174

Raising a standard exception is a clean thing to do here - we init MPI and technically there is no AMReX involved at this point in time yet.

ax3l · 2023-08-28T17:24:25Z

I tested this on Frontier and it does not cause issues.

ax3l · 2023-08-28T17:24:48Z

@tmsclark @lucafedeli88 let me know if it fixes your LUMI issue :)

ax3l added bug Something isn't working backend: hip Specific to ROCm execution (GPUs) component: parallelization Guard cell exchanges and particle redistribution workaround labels Aug 25, 2023

ax3l requested review from WeiqunZhang, tmsclark and lucafedeli88 August 25, 2023 23:36

ax3l mentioned this pull request Aug 25, 2023

Crash with large 3D simulations on LUMI #4236

Open

Work-Around: Segfault in MPI_Init with HIP

9f00595

See: https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html#olcfdev-1655-occasional-seg-fault-during-mpi-init

ax3l force-pushed the work-around-hip-mpi-hpe branch from abedf6c to 9f00595 Compare August 26, 2023 00:48

Move to ABLASTR

88f8780

All that counts is that HIP is initialized before GPU-aware MPI.

ax3l commented Aug 26, 2023

View reviewed changes

Source/ablastr/parallelization/MPIInitHelpers.cpp Show resolved Hide resolved

ax3l commented Aug 26, 2023

View reviewed changes

Source/ablastr/parallelization/MPIInitHelpers.cpp Outdated Show resolved Hide resolved

Add Exception

fa2f5c4

lucafedeli88 reviewed Aug 26, 2023

View reviewed changes

ax3l merged commit f02ad26 into ECP-WarpX:development Aug 28, 2023
32 checks passed

ax3l deleted the work-around-hip-mpi-hpe branch August 28, 2023 17:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work-Around: Segfault in MPI_Init with HIP #4237

Work-Around: Segfault in MPI_Init with HIP #4237

ax3l commented Aug 25, 2023 •

edited

Loading

lucafedeli88 left a comment

lucafedeli88 Aug 26, 2023

ax3l Aug 26, 2023 •

edited

Loading

ax3l commented Aug 28, 2023

ax3l commented Aug 28, 2023

Work-Around: Segfault in MPI_Init with HIP #4237

Work-Around: Segfault in MPI_Init with HIP #4237

Conversation

ax3l commented Aug 25, 2023 • edited Loading

lucafedeli88 left a comment

Choose a reason for hiding this comment

lucafedeli88 Aug 26, 2023

Choose a reason for hiding this comment

ax3l Aug 26, 2023 • edited Loading

Choose a reason for hiding this comment

ax3l commented Aug 28, 2023

ax3l commented Aug 28, 2023

ax3l commented Aug 25, 2023 •

edited

Loading

ax3l Aug 26, 2023 •

edited

Loading