-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work-Around: Segfault in MPI_Init with HIP #4237
Conversation
abedf6c
to
9f00595
Compare
All that counts is that HIP is initialized before GPU-aware MPI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks really a lot for this PR! I've left a small comment. I'll test these changes as soon as possible.
#if defined(AMREX_USE_HIP) && defined(AMREX_USE_MPI) | ||
hipError_t hip_ok = hipInit(0); | ||
if (hip_ok != hipSuccess) { | ||
std::cerr << "hipInit failed with error code " << hip_ok << "! Aborting now.\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for not using ABLASTR_ALWAYS_ASSERT_WITH_MESSAGE
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes: anything calling into AMReX functions cannot be used (safely) before AMReX is initialized. amrex::Assert
implements multiple things, not all of them work with an uninitialized AMReX context.
For instance, not even amrex::Print()
works before init, I had to develop a work-around falling back to all-print in pyAMReX: AMReX-Codes/pyamrex#174
Raising a standard exception is a clean thing to do here - we init MPI and technically there is no AMReX involved at this point in time yet.
I tested this on Frontier and it does not cause issues. |
@tmsclark @lucafedeli88 let me know if it fixes your LUMI issue :) |
See:
https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html#olcfdev-1655-occasional-seg-fault-during-mpi-init
Proposed work-around for @tmsclark in #4236.
I think this might be a general defect of GPU-aware MPI implementations from HPE/Cray at this point, explicit device context init before MPI init can help establishing GPU-aware MPI init assumptions at this point.
Clarifying with AMD if we can have a check for an already existing HIP/ROCm initialized runtime, in case we want to move this safety net to AMReX at some point, too. (Also, not clear if
hipInit
is idempotent.)ablastr::parallelization::mpi_init
?