-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang in one-sided communications across multiple nodes #7118
Comments
@victor-anisimov I built a debug version of mpich on Aurora at I am not sure the easiest way to switch
and verify with
|
Thank you for the suggestion, @hzhou ! I linked the reproducer code against your version of MPICH by doing The test job crashed at initialization with the error message: I guess the MPICH build configuration causes this crash. Could you take the reproducer, which is very simple, and try to run the test? |
Make sure to set
Make it
Running 100 loops now (or until my 2 hour walltime kicks me out), will update. |
I think I got some timeouts, need confirm with larger
It is stuck in a internal Barrier. |
Timed out in job 5, 23, and 24. |
I can run the test and am able to reproduce the backtrace after setting module unload mpich Any idea what might have caused the hang in internal_Win_fence? |
Previous investigation was that one of the target processes somehow gets overwhelmed by |
Thanks @raffenet for the potential clue. I can see how that (single process get overwhelmed due to lower layer failure or congestion) may cause the issue. |
I tried the latest commit b3480dd, which includes #7117, built by @colleeneb on Aurora. The behavior of the test is unchanged. I still get 8 hangs per 50 runs of the test. All 8 jobs got killed by the scheduler due to exceeding the walltime limit. None of the 8 hanging jobs include any error or warning messages from MPICH. My conclusion is that either I do not know how to properly run the MPICH to get those forward progress errors printed out or the fixes do not work for the hang in one-sided communications. Could @hzhou or @raffenet try using the latest MPICH with the one-sided bug reproducer to see if the forward progress fixes work for the present case? |
The one built by Colleen doesn't have |
Ah, thanks, it does not! I just used |
@colleeneb Only do that for debug build (vs. performance build). For debug build, you may also want to use |
I ran the ANL-116 reproducer on Aurora and with an instrumented libfabric. I verified that the remote key MPICH is passing into libfabric RMA operations is not a key generated by the libfabric CXI provider. There appears to be an MPICH bug resulting in MPICH providing an invalid remote key which can result in hangs and failed RMA operations. |
@iziemba Invalid in what way? I see that user-provided keys are limited to 4-bytes in the |
I tried adding a dummy registration to my MPICH build just to see what happens. Unless I am missing something, it looks like libfabric will not complain about the invalid key, so we need to implement some verification inside MPICH itself. Here is the code for the registration I tried: int ret;
uint64_t key = UINT64_MAX;
void *buf = MPL_malloc(16, MPL_MEM_OTHER);
struct fid_mr *mr;
ret = fi_mr_reg(MPIDI_OFI_global.ctx[0].domain,
buf, 16,
FI_REMOTE_READ, 0,
key, 0, &mr, NULL);
printf("ret = %d, %s\n", ret, fi_strerror(ret)); and here is the output 😕
|
More info after building my own libfabric and adding debug printfs, the provider thinks the requested key is a provider key for some reason and therefore thinks it is valid. If I enable |
So when the user passes in an invalid requested key, it may accidentally cause cxi to think it is a provider key and take the bool cxip_generic_is_valid_mr_key(uint64_t key)
{
struct cxip_mr_key cxip_key = {
.raw = key,
};
if (cxip_key.is_prov)
return cxip_is_valid_prov_mr_key(key);
return cxip_is_valid_mr_key(key);
} |
After looking closer, I no longer thing MPICH will generate a key larger than 4 bytes in this scenario. Must be something else. |
For user keys, I can update CXI provider to have check for RKEY size and ensure it is <= 4 bytes. FI_MR_PROV_KEY key size is 8 bytes. When running with MPIR_CVAR_CH4_OFI_ENABLE_MR_PROV_KEY=1, I verified that libfabric CXI provider was returning expected generated RKEYs. Then, I see MPICH use an invalid key. Example:
|
Hm, OK. The warning is from the RMA origin side. So we need to trace keys from creation at the target -> usage at the origin and see if there's corruption or something. |
Potential work around in #7202 |
Update: 7202 didn’t help but turning HMEM on did, although it led to the device running out of memory |
Need to investigate the out of memory issue and see if it can be resolved in MPICH or libfabric. |
reproducer.tgz
Once-sided communications between GPU pointers intermittently hang on Aurora. The attached reproducer runs on 96 nodes using six 16-node subcommunicators. One needs to run about 10-50 jobs in order to catch the hang. Successful job contains "All Done" in the output file. If that string is absent after the time runs out, it indicates the hang. The job requires 2-3 minutes of walltime to complete. The job that does not finish in 5 min most likely hangs and won't progress.
The text was updated successfully, but these errors were encountered: