Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bus error while running ilamb with mpiexec on access-med-0.6 #98

Open
rhaegar325 opened this issue Aug 21, 2024 · 2 comments
Open

Bus error while running ilamb with mpiexec on access-med-0.6 #98

rhaegar325 opened this issue Aug 21, 2024 · 2 comments

Comments

@rhaegar325
Copy link

rhaegar325 commented Aug 21, 2024

Hi, @dsroberts, @rbeucher :
Today there was an error occured while I running running ilamb with mpiexec on access-med-0.6, this is the detail.(this run have 24 processes so there are some redundant information here)

Loading conda/access-med-0.6
  Loading requirement: singularity
Currently Loaded Modulefiles:
 1) singularity   2) conda/access-med-0.6  
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_MCAST] No MCAST components selected

[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_ML] Failure in hcoll_mcast_base_select
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[gadi-cpu-clx-2405:1119691:0:1119691] Caught signal 7 (Bus error: nonexistent physical address)
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
==== backtrace (tid:1119691) ====
 0 0x0000000000012d20 __funlockfile()  :0
 1 0x00000000000a8186 NC4_def_var()  ???:0
 2 0x000000000003f2ad nc_def_var()  ???:0
 3 0x00000000000b75b3 __pyx_pw_7netCDF4_8_netCDF4_8Variable_1__init__()  _netCDF4.c:0
 4 0x000000000013ddbb type_call()  :0
 5 0x000000000002288f __Pyx_PyObject_Call()  _netCDF4.c:0
 6 0x0000000000040928 __pyx_pw_7netCDF4_8_netCDF4_7Dataset_47createVariable()  _netCDF4.c:0
 7 0x00000000001445a6 cfunction_call()  :0
 8 0x000000000013da6b _PyObject_MakeTpCall.localalias()  :0
 9 0x0000000000139c53 _PyEval_EvalFrameDefault()  ???:0
10 0x0000000000150582 method_vectorcall()  :0
11 0x00000000001358fa _PyEval_EvalFrameDefault()  ???:0
12 0x0000000000144a2c _PyFunction_Vectorcall()  ???:0
13 0x0000000000134c5c _PyEval_EvalFrameDefault()  ???:0
14 0x0000000000144a2c _PyFunction_Vectorcall()  ???:0
15 0x0000000000134850 _PyEval_EvalFrameDefault()  ???:0
16 0x00000000001d7c60 _PyEval_Vector()  :0
17 0x00000000001d7ba7 PyEval_EvalCode()  ???:0
18 0x000000000020812a run_eval_code_obj()  :0
19 0x0000000000203523 run_mod()  :0
20 0x000000000009a6f5 pyrun_file.cold()  :0
21 0x00000000001fd9fe _PyRun_SimpleFileObject.localalias()  :0
22 0x00000000001fd594 _PyRun_AnyFileObject.localalias()  :0
23 0x00000000001fa78b Py_RunMain.localalias()  :0
24 0x00000000001cb1f7 Py_BytesMain()  ???:0
25 0x000000000003a7e5 __libc_start_main()  ???:0
26 0x00000000001cb0f1 _start()  ???:0
=================================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[gadi-cpu-clx-2405:1119698:0:1119698] Caught signal 7 (Bus error: nonexistent physical address)
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
BFD: Dwarf Error: Can't find .debug_ranges section.
--------------------------------------------------------------------------
mpiexec noticed that process rank 9 with PID 0 on node gadi-cpu-clx-2405 exited on signal 7 (Bus error).
--------------------------------------------------------------------------

The same script works fine on /g/data/hh5/public/modules/conda_concept/analysis3, so I think this might be an issue about the module.

@dsroberts
Copy link

dsroberts commented Sep 19, 2024

Hi @rhaegar325 apologies for the delayed response, since I've left CLEX, the containerised conda environments that this is based on are no longer supported. That being said, I'm applying this approach to another application and have encountered this error (not the bus error, but the ml_discover_hierarchy exited with error. The error can be made to go away by disabling HCOLL (export OMPI_MCA_coll=^hcoll), but the circumstances under which this appears seems to be quite specific. As far as I can tell, this error is specific to using mpi4py in a container. I'm yet to come up with a small reproducer, but I'll keep you up to date on progress.
Edit: turns out this happens outside of containerised environments too, import mpi4py is enough to reproduce when run in parallel.

@rbeucher
Copy link
Member

Thanks @dsroberts. The containerised environment has proven quite helpful and we hope we can carry on with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants