Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in one-sided communications across multiple nodes #7118

Open
victor-anisimov opened this issue Aug 27, 2024 · 23 comments
Open

Hang in one-sided communications across multiple nodes #7118

victor-anisimov opened this issue Aug 27, 2024 · 23 comments
Labels

Comments

@victor-anisimov
Copy link

victor-anisimov commented Aug 27, 2024

reproducer.tgz

Once-sided communications between GPU pointers intermittently hang on Aurora. The attached reproducer runs on 96 nodes using six 16-node subcommunicators. One needs to run about 10-50 jobs in order to catch the hang. Successful job contains "All Done" in the output file. If that string is absent after the time runs out, it indicates the hang. The job requires 2-3 minutes of walltime to complete. The job that does not finish in 5 min most likely hangs and won't progress.

@hzhou
Copy link
Contributor

hzhou commented Aug 28, 2024

@victor-anisimov I built a debug version of mpich on Aurora at /home/hzhou/pull_requests/mpich-debug/_inst. Could you try link with that and run the reproducer while setting MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=1? The goal is to generate a backtrace as shown in #7084

I am not sure the easiest way to switch libmpi.so. Try

export LD_LIBRARY_PATH=/home/hzhou/pull_requests/mpich-debug/_inst/lib:$LD_LIBRARY_PATH

and verify with ldd -- [EDIT: that likely won't work due to so version mismatch]. If that doesn't work, try

LD_PRELOAD=/home/hzhou/pull_requests/mpich-debug/_inst/lib/libmpi.so.0

@victor-anisimov
Copy link
Author

victor-anisimov commented Aug 28, 2024

Thank you for the suggestion, @hzhou ! I linked the reproducer code against your version of MPICH by doing
module unload mpich
export PATH=${PATH}:/home/hzhou/pull_requests/mpich-debug/_inst/bin
mpif90 -fc=ifx -O0 -g -i4 -r8 -what -fiopenmp -fopenmp-targets=spir64 test.F90
ldd ./a.out
libmpifort.so.0 => /home/hzhou/pull_requests/mpich-debug/_inst/lib/libmpifort.so.0 (0x00007f017ddf7000)
libmpi.so.0 => /home/hzhou/pull_requests/mpich-debug/_inst/lib/libmpi.so.0 (0x00007f017b486000)
./bulk.sh

The test job crashed at initialization with the error message:
forrtl: severe (71): integer divide by zero
Image PC Routine Line Source
libpthread-2.31.s 00001460507998C0 Unknown Unknown Unknown
a.out 000000000040C7E2 Unknown Unknown Unknown
a.out 000000000040C47D Unknown Unknown Unknown
libc-2.31.so 00001460505C324D __libc_start_main Unknown Unknown
a.out 000000000040C3AA Unknown Unknown Unknown

I guess the MPICH build configuration causes this crash. Could you take the reproducer, which is very simple, and try to run the test?

reproducer6.tgz

@hzhou
Copy link
Contributor

hzhou commented Aug 29, 2024

module unload mpich

Make sure to set PALS_PMI=pmix if you unload mpich module.

while setting MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=1

Make it MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=10. 1 second timeout was too short. 10 second seems to be sufficient for normal progress without hung.

Could you take the reproducer,

Running 100 loops now (or until my 2 hour walltime kicks me out), will update.

@hzhou
Copy link
Contributor

hzhou commented Aug 29, 2024

I think I got some timeouts, need confirm with larger MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT -

...
Rank 0 conducted      1148200 one-sided messages
Rank 0 conducted      1148300 one-sided messages
3 pending requests in pool 0
    ac000004: MPIDI_OFI_do_irecv: source=4, tag=14, data_sz=0
/home/hzhou/MPI/lib/libmpi.so.0(+0xc1bad4) [0x14ab09776ad4]
/home/hzhou/MPI/lib/libmpi.so.0(+0x954a0b) [0x14ab094afa0b]
/home/hzhou/MPI/lib/libmpi.so.0(+0x9557cd) [0x14ab094b07cd]
/home/hzhou/MPI/lib/libmpi.so.0(+0x955167) [0x14ab094b0167]
/home/hzhou/MPI/lib/libmpi.so.0(+0x802a58) [0x14ab0935da58]
/home/hzhou/MPI/lib/libmpi.so.0(+0x6bf95b) [0x14ab0921a95b]
/home/hzhou/MPI/lib/libmpi.so.0(+0x6d1866) [0x14ab0922c866]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7977d8) [0x14ab092f27d8]
/home/hzhou/MPI/lib/libmpi.so.0(+0x797aa3) [0x14ab092f2aa3]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7caee0) [0x14ab09325ee0]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7ca90e) [0x14ab0932590e]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7cac96) [0x14ab09325c96]
/home/hzhou/MPI/lib/libmpi.so.0(+0x797f97) [0x14ab092f2f97]
/home/hzhou/MPI/lib/libmpi.so.0(+0x797c72) [0x14ab092f2c72]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4db4dc) [0x14ab090364dc]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4da580) [0x14ab09035580]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4da3f9) [0x14ab090353f9]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4da1cc) [0x14ab090351cc]
/home/hzhou/MPI/lib/libmpi.so.0(MPI_Win_fence+0x19) [0x14ab09034e69]
/home/hzhou/MPI/lib/libmpifort.so.0(mpi_win_fence_f08_+0x2d) [0x14ab0b5898ad]
./a.out() [0x412013]
./a.out() [0x40c4fd]
/lib64/libc.so.6(__libc_start_main+0xef) [0x14ab0782d24d]
./a.out() [0x40c42a]
...
Abort(546498063) on node 864: Fatal error in internal_Win_fence: Other MPI error, error stack:
internal_Win_fence(83).....................: MPI_Win_fence(assert=0, win=0xa0000000) failed
MPID_Win_fence(124)........................:
MPID_Barrier(167)..........................:
MPIDI_Barrier_allcomm_composition_json(132):
MPIDI_Barrier_intra_composition_alpha(184).:
MPIDI_NM_mpi_barrier(41)...................:
MPIR_Barrier_impl(105).....................:
MPIR_Barrier_allcomm_auto(53)..............:
MPIR_Barrier_intra_recexch(19).............:
MPIR_Allreduce_intra_recexch(238)..........:
MPIC_Waitall(570)..........................:
MPIR_Waitall(907)..........................:
MPIR_Waitall_state(809)....................: Operation timed out
Abort(546498063) on node 876: Fatal error in internal_Win_fence: Other MPI error, error stack:
internal_Win_fence(83).....................: MPI_Win_fence(assert=0, win=0xa0000000) failed
MPID_Win_fence(124)........................:
MPID_Barrier(167)..........................:
MPIDI_Barrier_allcomm_composition_json(132):
MPIDI_Barrier_intra_composition_alpha(184).:
MPIDI_NM_mpi_barrier(41)...................:
MPIR_Barrier_impl(105).....................:
MPIR_Barrier_allcomm_auto(53)..............:
MPIR_Barrier_intra_recexch(19).............:
MPIR_Allreduce_intra_recexch(238)..........:
MPIC_Waitall(570)..........................:
MPIR_Waitall(907)..........................:
MPIR_Waitall_state(809)....................: Operation timed out
...

It is stuck in a internal Barrier.

@hzhou
Copy link
Contributor

hzhou commented Aug 29, 2024

 grep 'All Done' job*.out
job10.out:All Done
job11.out:All Done
job12.out:All Done
job13.out:All Done
job14.out:All Done
job16.out:All Done
job17.out:All Done
job18.out:All Done
job19.out:All Done
job1.out:All Done
job21.out:All Done
job22.out:All Done
job25.out:All Done
job26.out:All Done
job27.out:All Done
job28.out:All Done
job29.out:All Done
job2.out:All Done
job30.out:All Done
job31.out:All Done
job32.out:All Done
job33.out:All Done
job34.out:All Done
job35.out:All Done
job36.out:All Done
job37.out:All Done
job38.out:All Done
job39.out:All Done
job3.out:All Done
job40.out:All Done
job41.out:All Done
job42.out:All Done
job4.out:All Done
job6.out:All Done
job7.out:All Done
job8.out:All Done
job9.out:All Done
job.out:All Done

Timed out in job 5, 23, and 24.

@victor-anisimov
Copy link
Author

I can run the test and am able to reproduce the backtrace after setting

module unload mpich
export PALS_PMI=pmix
export MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=120

Any idea what might have caused the hang in internal_Win_fence?

@raffenet
Copy link
Contributor

raffenet commented Aug 29, 2024

I can run the test and am able to reproduce the backtrace after setting

module unload mpich export PALS_PMI=pmix export MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=120

Any idea what might have caused the hang in internal_Win_fence?

Previous investigation was that one of the target processes somehow gets overwhelmed by MPI_GET requests and basically stops being able to send responses to the origin. We suspected a libfabric/cxi issue, but it was not clear why it would hang rather than report an error and crash. It's possible that #7117 may improve the error reporting in this case.

@hzhou
Copy link
Contributor

hzhou commented Aug 29, 2024

Thanks @raffenet for the potential clue. I can see how that (single process get overwhelmed due to lower layer failure or congestion) may cause the issue.

@victor-anisimov
Copy link
Author

victor-anisimov commented Aug 30, 2024

I tried the latest commit b3480dd, which includes #7117, built by @colleeneb on Aurora. The behavior of the test is unchanged. I still get 8 hangs per 50 runs of the test. All 8 jobs got killed by the scheduler due to exceeding the walltime limit. None of the 8 hanging jobs include any error or warning messages from MPICH. My conclusion is that either I do not know how to properly run the MPICH to get those forward progress errors printed out or the fixes do not work for the hang in one-sided communications. Could @hzhou or @raffenet try using the latest MPICH with the one-sided bug reproducer to see if the forward progress fixes work for the present case?

@hzhou
Copy link
Contributor

hzhou commented Aug 30, 2024

The one built by Colleen doesn't have --enable-g=progress on I believe. We suspect your case is due to the underlying network cannot handle certain congestion situations yet not raising an error. At this point, it is just guess. I need think about how to further narrow down the reproducer. Currently, a 1/20 chance thus hours of time for a single reproducer is a bit difficult to work with.

@colleeneb
Copy link
Collaborator

Ah, thanks, it does not! I just used ./configure --prefix=$PWD/../install_main --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-ze --with-pm=no --with-pmi=pmix --with-pmix=/usr CC=icx CXX=icpx FC=ifx. I will add --enable-g=progress to the configure line to try! Is there anything else to change in the build to test?

@hzhou
Copy link
Contributor

hzhou commented Aug 30, 2024

Ah, thanks, it does not! I just used ./configure --prefix=$PWD/../install_main --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-ze --with-pm=no --with-pmi=pmix --with-pmix=/usr CC=icx CXX=icpx FC=ifx. I will add --enable-g=progress to the configure line to try! Is there anything else to change in the build to test?

@colleeneb Only do that for debug build (vs. performance build). For debug build, you may also want to use --enable-g=dbg,log,progress

@iziemba
Copy link

iziemba commented Nov 6, 2024

I ran the ANL-116 reproducer on Aurora and with an instrumented libfabric. I verified that the remote key MPICH is passing into libfabric RMA operations is not a key generated by the libfabric CXI provider. There appears to be an MPICH bug resulting in MPICH providing an invalid remote key which can result in hangs and failed RMA operations.

@raffenet
Copy link
Contributor

raffenet commented Nov 6, 2024

I ran the ANL-116 reproducer on Aurora and with an instrumented libfabric. I verified that the remote key MPICH is passing into libfabric RMA operations is not a key generated by the libfabric CXI provider. There appears to be an MPICH bug resulting in MPICH providing an invalid remote key which can result in hangs and failed RMA operations.

@iziemba Invalid in what way? I see that user-provided keys are limited to 4-bytes in the fi_cxi manpage. It does look like MPICH could generate a key larger than that the way to code is setup today. But looking at the error checks, I would expect MPICH to get -FI_EKEYREJECTED from fi_mr_reg and return it to the user rather than proceed and hang.

@raffenet
Copy link
Contributor

raffenet commented Nov 6, 2024

I tried adding a dummy registration to my MPICH build just to see what happens. Unless I am missing something, it looks like libfabric will not complain about the invalid key, so we need to implement some verification inside MPICH itself. Here is the code for the registration I tried:

    int ret;
    uint64_t key = UINT64_MAX;
    void *buf = MPL_malloc(16, MPL_MEM_OTHER);
    struct fid_mr *mr;
    ret = fi_mr_reg(MPIDI_OFI_global.ctx[0].domain,
                   buf, 16,
                   FI_REMOTE_READ, 0,
                   key, 0, &mr, NULL);
    printf("ret = %d, %s\n", ret, fi_strerror(ret));

and here is the output 😕

ret = 0, Success

@raffenet
Copy link
Contributor

raffenet commented Nov 6, 2024

More info after building my own libfabric and adding debug printfs, the provider thinks the requested key is a provider key for some reason and therefore thinks it is valid. If I enable MPIR_CVAR_CH4_OFI_ENABLE_MR_PROV_KEY=1 then the provider actually skips the validation step altogether because of the domain being setup to use prov keys exclusively, I believe.

@raffenet
Copy link
Contributor

raffenet commented Nov 6, 2024

So when the user passes in an invalid requested key, it may accidentally cause cxi to think it is a provider key and take the cxip_key.is_prov branch and somehow pass the validation there...

bool cxip_generic_is_valid_mr_key(uint64_t key)
{
        struct cxip_mr_key cxip_key = {
                .raw = key,
        };

        if (cxip_key.is_prov)
                return cxip_is_valid_prov_mr_key(key);

        return cxip_is_valid_mr_key(key);
}

@raffenet
Copy link
Contributor

raffenet commented Nov 6, 2024

After looking closer, I no longer thing MPICH will generate a key larger than 4 bytes in this scenario. Must be something else.

@iziemba
Copy link

iziemba commented Nov 6, 2024

For user keys, I can update CXI provider to have check for RKEY size and ensure it is <= 4 bytes.

FI_MR_PROV_KEY key size is 8 bytes. When running with MPIR_CVAR_CH4_OFI_ENABLE_MR_PROV_KEY=1, I verified that libfabric CXI provider was returning expected generated RKEYs. Then, I see MPICH use an invalid key. Example:

libfabric:130635:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c0s5b0n0: MR enabled: 0x4f40f70 (key: 0xA000DD743707002F)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x4c37c10 (key: 0xA000EF2705D8004D)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5894a10 (key: 0xA000A22111480016)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x58947b0 (key: 0xA000A2211149001B)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x367ec40 (key: 0xA000EF2705D9003C)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x367e870 (key: 0xA000EF2705DA0028)
libfabric:125040:1730402622::cxi:core:cxip_rma_common():609<warn> x4414c7s5b0n0: TXC (0x35f41:0):: Invalid remote key: 0x455346464f5f4441
libfabric:130635:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c0s5b0n0: MR enabled: 0x4e62670 (key: 0xA000DD7437080029)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5894c70 (key: 0xA000A221114A0024)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5894ed0 (key: 0xA000A221114B0037)
libfabric:130635:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c0s5b0n0: MR enabled: 0x50c8b30 (key: 0xA000DD7437090027)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x4c8df10 (key: 0xA000EF2705DB0040)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x436bdf0 (key: 0xA000EF2705DC0036)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5895390 (key: 0xA000A221114C0009)

@raffenet
Copy link
Contributor

raffenet commented Nov 6, 2024

libfabric:125040:1730402622::cxi:core:cxip_rma_common():609<warn> x4414c7s5b0n0: TXC (0x35f41:0):: Invalid remote key: 0x455346464f5f4441

Hm, OK. The warning is from the RMA origin side. So we need to trace keys from creation at the target -> usage at the origin and see if there's corruption or something.

@hzhou
Copy link
Contributor

hzhou commented Nov 10, 2024

Potential work around in #7202

@colleeneb
Copy link
Collaborator

Update: 7202 didn’t help but turning HMEM on did, although it led to the device running out of memory

@raffenet
Copy link
Contributor

Need to investigate the out of memory issue and see if it can be resolved in MPICH or libfabric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants