Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX error in OpenMPI-4.1.1 foss-2021a build #756

Open
connorourke opened this issue Nov 10, 2021 · 5 comments
Open

UCX error in OpenMPI-4.1.1 foss-2021a build #756

connorourke opened this issue Nov 10, 2021 · 5 comments
Milestone

Comments

@connorourke
Copy link

During the build of OpenMPI-4.1.1 with the foss-2021a toolchain I get the following error:

[1636483897.276373] [ip-AC125812:109544:0]          mm_ep.c:154  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc00000084001abe5: Shared memory error
[ip-AC125812:109544] pml_ucx.c:419  Error: ucp_ep_create(proc=0) failed: Shared memory error
[1636483897.280964] [ip-AC125812:109542:0]       mm_posix.c:194  UCX  ERROR open(file_name=/proc/109541/fd/33 flags=0x0) failed: No such file or directory
[1636483897.281006] [ip-AC125812:109542:0]          mm_ep.c:154  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc00000084001abe5: Shared memory error
[ip-AC125812:109542] pml_ucx.c:419  Error: ucp_ep_create(proc=0) failed: Shared memory error
[1636483897.281576] [ip-AC125812:109543:0]       mm_posix.c:194  UCX  ERROR open(file_name=/proc/109541/fd/33 flags=0x0) failed: No such file or directory
[1636483897.281602] [ip-AC125812:109543:0]          mm_ep.c:154  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc00000084001abe5: Shared memory error
[ip-AC125812:109543] pml_ucx.c:419  Error: ucp_ep_create(proc=0) failed: Shared memory error
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[ip-AC125812:109544] *** An error occurred in MPI_Init
[ip-AC125812:109544] *** reported by process [2187460609,3]
[ip-AC125812:109544] *** on a NULL communicator
[ip-AC125812:109544] *** Unknown error
[ip-AC125812:109544] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[ip-AC125812:109544] ***    and potentially your MPI job)
[ip-AC125812:109519] 2 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[ip-AC125812:109519] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ip-AC125812:109519] 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
) (at easybuild/framework/easyblock.py:3311 in _sanity_check_step)
== 2021-11-09 18:51:42,522 build_log.py:265 INFO ... (took 17 secs)
== 2021-11-09 18:51:42,522 filetools.py:1971 INFO Removing lock /scratch/cor22/bin/BUILD/EB/janus_easybuild/instances/hbv2/2021a/software/.locks/_scratch_cor22_bin_BUILD_EB_janus_easybuild_instances_hbv2_2021a_software_OpenMPI_4.1.1-GCC-10.3.0.lock...
== 2021-11-09 18:51:42,530 filetools.py:380 INFO Path /scratch/cor22/bin/BUILD/EB/janus_easybuild/instances/hbv2/2021a/software/.locks/_scratch_cor22_bin_BUILD_EB_janus_easybuild_instances_hbv2_2021a_software_OpenMPI_4.1.1-GCC-10.3.0.lock successfully removed.
== 2021-11-09 18:51:42,530 filetools.py:1975 INFO Lock removed: /scratch/cor22/bin/BUILD/EB/janus_easybuild/instances/hbv2/2021a/software/.locks/_scratch_cor22_bin_BUILD_EB_janus_easybuild_instances_hbv2_2021a_software_OpenMPI_4.1.1-GCC-10.3.0.lock
== 2021-11-09 18:51:42,530 easyblock.py:3915 WARNING build failed (first 300 chars): Sanity check failed: sanity check command mpirun -n 4 /scratch/cor22/bin/BUILD/EB/janus_easybuild/instances/hbv2/2021a/build/OpenMPI/4.1.1/GCC-10.3.0/mpi_test_hello_usempi exited with code 1 (output: [1636483897.276312] [ip-AC125812:109544:0]       mm_posix.c:194  UCX  ERROR open(file_name=/proc/109
== 2021-11-09 18:51:42,531 easyblock.py:307 INFO Closing log for application name OpenMPI version 4.1.1

Looks like it is to do with UCX trying to open up a non-existent file.

Has anyone seen this error and knows a fix?

@boegel boegel added this to the 4.x milestone Nov 10, 2021
@boegel
Copy link
Member

boegel commented Nov 24, 2021

@connorourke This looks a lot like the problem reported upstream at openucx/ucx#4224 .

Are you running in a user namespace?

@connorourke
Copy link
Author

Nope - not running in a user namespace @boegel.

@vanzod
Copy link
Member

vanzod commented Feb 16, 2022

@boegel I just hit the exact same issue with the foss-2021b toolchain. The strange thing is that it happens on certain machines while on others it builds smoothly.

@connorourke On which hardware were you trying to build it?

@connorourke
Copy link
Author

It was on an AMD Milan EPYC 7V13.

@hezhiqiang8909
Copy link

== FAILED: Installation ended unsuccessfully (build directory: /public/software/.local/easybuild/build/OpenMPI/4.1.1/GCC-10.3.0): build failed (first 300 chars): Sanity check failed: sanity check command OMPI_MCA_rmaps_base_oversubscribe=1 mpirun -n 4 /public/software/.local/easybuild/build/OpenMPI/4.1.1/GCC-10.3.0/mpi_test_hello_c exited with code 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants