Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerous run time issues on Betzy login3 #573

Open
mvertens opened this issue Oct 9, 2024 · 16 comments
Open

Numerous run time issues on Betzy login3 #573

mvertens opened this issue Oct 9, 2024 · 16 comments

Comments

@mvertens
Copy link

mvertens commented Oct 9, 2024

This is a beginning place holder for numerous issues that have occurred on Betzy as part of the OS upgrade. Currently, since only login3 is available - these have all occurred there.

From @mvertens:
Currently this is all using the noresm2_5_alpha06 code base that was just created last week.
There are two separate errors I encountered - both which I reported to sigma2.

  1. the UCX error that led to a timeout.

==== backtrace (tid: 40171) ==== 0 0x000000000005e810 uct_ud_ep_deferred_timeout_handler() /build-result/src/hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat9-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/uct/ib/ud/base/ud_ep.c:278 1 0x000000000004fd37 ucs_callbackq_slow_proxy() /build-result/src/hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat9-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/ucs/datastruct/callbackq.c:404 2 0x000000000004881a ucs_callbackq_dispatch() /build-result/src/hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat9-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc

There seems to be an outstanding issue on this here: openucx/ucx#5159
Sigma2 suggests moving to the intel/2023a tool chain (currently @mvdebolskiy is working on this) - but it might be that Sigma2 need to upgrade there openucx.

  1. In a totally different experimental configuration - I obtained the following error:
    [LOG_CAT_MLB] Registration of 0 network context failed. Don't use HCOLL
    [LOG_CAT_MLB] Failed to grow mlb dynamic manager
    [LOG_CAT_MLB] Payload allocation failed
    [LOG_CAT_BASESMUMA] Failed to shmget with IPC_PRIVATE, size 20971520, IPC_CREAT; errno 28:No space left on device
    [LOG_CAT_MLB] Registration of 0 network context failed. Don't use HCOLL
    [LOG_CAT_MLB] Failed to grow mlb dynamic manager

In this case the solution was to set the environment variable OMPI_MCA_coll_hcoll_enable to 0.
Sigma2 has a fix for (2) which requires the 2023a tool chain and that Matvey is working on.

I am not sure that updating to the 2023a tool chain will fix (1).
I think we should try the new tool chain once @mvdebolskiy is ready with the update and see if (1) occurs again.

@blcc
Copy link
Contributor

blcc commented Oct 10, 2024

NorESM job does not stop when error occurred. Since the tasks exit without send error to mpi lib.
There's a argument in srun '-K1' (or --kill-on-bad-exit=1) can stop the job when any task exit with error.
It does not solve problem, but perhaps useful here.

@mvertens
Copy link
Author

What you want to do is the following

  1. In your noresm2_5_alpha06 sandbox edit the following file
    $SRCROOT/ccs_config/machines/betzy/config_machines.xml
    And make the following changes:
-    <executable>srun</executable> (old)
+    <executable>srun --kill-on-bad-exit --label</executable> (new)
-    <env name="OMPI_MCA_coll_hcoll_enable">1</env> (old)
+    <env name="OMPI_MCA_coll_hcoll_enable">0</env> (new)

You can see the documentation for srun on betzy just using srun --help

  1. In your $CASEROOT (if you have ALREADY created one)
    ./case.setup --reset
    ./case.build
    ./case.submit

I think this can also be applied to noresm2.1 and noresm2.3 versions.

@IngoBethke
Copy link
Contributor

IngoBethke commented Oct 12, 2024

I have started running NorESM2-LM (release-noresm2.0.8) tests using 2023 compiler/library versions (which load hpcx/2.20 instead of hpcx/2.14 that is loaded by the 2022 versions) to see if that can help avoiding the UCX errors.

First, I tried 2023a using
<command name="load">netCDF-Fortran/4.6.1-iompi-2023a</command>
but that resulted consistently in PIO errors during model startup of type
box_rearrange::compute_dest:: ERROR: no destination found for compdof= 6949

Then, I tried 2023b (env_mach_specific.xml.gz) using
<command name="load">netCDF-Fortran/4.6.1-iompi-2023b</command>
and now the model starts without PIO error.

I am currently running a coupled of 25-node jobs that each run 5 instances of NorESM2-LM and one 100-node job (with 25 instances). No UCX so far.

@mvdebolskiy
Copy link

@IngoBethke
I have just ran ERS_Ld5 tests with debug and without with default externals for 2.0.9 and they passed.
I am afraid netCDF-Fortran/4.6.1-iompi-2023b won't work with DEBUG=TRUE, since it has HDF5-1.14.3 without a patch for floating point exceptions. So, it's either netCDF/4.6.0-iompi-2022a or we need to wait for sigma2 to update whole netCDF stack.
I am building a netCDF stack with HDF5-1.14.4 (which does not require a patch for -fpe).

In your env_mach_specific.xml:

     <command name="load">CMake/3.26.3-GCCcore-12.3.0</command>
      <command name="load">Python/3.11.3-GCCcore-12.3.0</command>
      <command name="--ignore_cache load">XML-LibXML/2.0209-GCCcore-12.3.0</command>

Will override some dependencies of netCDF (bzip2,zlib,Szip f.e) which is not a big deal, but might cause some issues later.

@jmaerz
Copy link
Contributor

jmaerz commented Oct 14, 2024

I have a maybe stupid question: could the ucx-errors also be node-dependent? - I am asking since I ran now a few (short) runs successfully with the very same settings as before, but they ran on different nodes (previously, they crashed reporting errors for nodes b4394 and b4396).

@mvdebolskiy
Copy link

@jmaerz sigma2 fixed the ucx error, so old configs should work on all nodes as before.
I am wondering if people are still getting errors with HCOLL, because it does not appear in my tests with iompi-2022a based modules.

@IngoBethke
Copy link
Contributor

IngoBethke commented Oct 14, 2024

UPDATE: see above post by Matvey. @mvdebolskiy, can you post some more info what Sigma2 did to fix it?

@mvdebolskiy

I have now run over 1000 simulations years using netCDF-Fortran/4.6.1-iompi-2023b and not encountered a single crash.

If indeed switching from hpcx/2.14 to hpcx/2.20 made the difference, then it could be worth trying to load hpcx/2.20 together with the 2022a libraries versions:

module --quiet restore system
module load StdEnv
module load netCDF-Fortran/4.6.0-iompi-2022a hpcx/2.20
module load NCO/5.1.9-iomkl-2022a
module load CMake/3.23.1-GCCcore-11.3.0
module load Python/3.10.4-GCCcore-11.3.0

What do you think?

In my case, the ucx errors occurred unmistakenly in the hpcx/2.14 library

b2311:236954:0:236954]       ud_ep.c:278  Fatal: UD endpoint 0x1d8bb260 to <no debug data>: unhandled timeout error
 0 0x000000000005e810 uct_ud_ep_deferred_timeout_handler()  /build-result/src/hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat9-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/uct/ib/ud/base/ud_ep.c:278
 1 0x000000000004fd37 ucs_callbackq_slow_proxy()  /build-result/src/hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat9-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/ucs/datastruct/callbackq.c:404
 2 0x000000000004881a ucs_callbackq_dispatch()  /build-result/src/hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat9-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/ucs/datastruct/callbackq.h:211
 3 0x000000000004881a uct_worker_progress()  /build-result/src/hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat9-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/uct/api/uct.h:2768
 4 0x000000000004881a ucp_worker_progress()  /build-result/src/hpcx-v2.14-gcc-MLNX_OFED_LINUX-5-redhat9-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/ucp/core/ucp_worker.c:2799
 5 0x000000000002f994 opal_progress()  /cluster/work/users/vegarde/b5102/OpenMPI/4.1.4/intel-compilers-2022.1.0/openmpi-4.1.4/opal/runtime/opal_progress.c:231
 6 0x000000000005f64b hcoll_ml_progress_impl()  ???:0
 7 0x000000000004d8c1 hmca_coll_ml_parallel_bcast()  ???:0
 8 0x0000000000008252 mca_coll_hcoll_bcast()  /cluster/work/users/vegarde/b5102/OpenMPI/4.1.4/intel-compilers-2022.1.0/openmpi-4.1.4/ompi/mca/coll/hcoll/coll_hcoll_ops.c:59
 9 0x000000000006cd14 PMPI_Bcast()  /cluster/work/users/vegarde/b5102/OpenMPI/4.1.4/intel-compilers-2022.1.0/openmpi-4.1.4/ompi/mpi/c/profile/pbcast.c:114
10 0x0000000000051814 ompi_bcast_f()  /cluster/work/users/vegarde/b5102/OpenMPI/4.1.4/intel-compilers-2022.1.0/openmpi-4.1.4/ompi/mpi/fortran/mpif-h/profile/pbcast_f.c:80
11 0x0000000002764577 shr_mpi_mod_mp_shr_mpi_bcastc0_()  /cluster/projects/nn9039k/people/ingo/noresm2-lesfmip/cime/src/share/util/shr_mpi_mod.F90:604
12 0x00000000004a8558 seq_flds_mod_mp_seq_flds_set_()  /cluster/projects/nn9039k/people/ingo/noresm2-lesfmip/cases/N1850frc2_f19_tn14_LESFMIPhist-nat_temp/SourceMods/src.drv/seq_flds_mod.F90:444 

@mvertens

I tried setting OMPI_MCA_coll_hcoll_enable=0 because the uxc error message mentions hcoll. But it did not avoid the above error in my case. I am therefore unsure about setting OMPI_MCA_coll_hcoll_enable=0 and potentially sacrificing performance if it does not have a positive effect.

@jmaerz

Good question.

According to my and Alok's experience, the ucx error was not observed before mid-August and suspiciously close to the installation date of the current hpcx libaries.

That doesn't necessarily mean that a badly configured node or bad interconnect cannot trigger the error. To be on the safe side, my current runs exclude about 20 nodes (somewhat arbitrarily) but b4394 and b4396 are not among those: b2113,b2114,b2115,b2116,b5226,b1216,b1226,b3355,b3356,b3357,b3359,b3379,b3382,b3383,b3338,b3340,b3343,b3344,b3345,b2171,b2172,b2173

But I think I may try running without node-exclude list in the future.

@mvdebolskiy
Copy link

@IngoBethke
Marcin from sigma2 closed the gitlab ticket with:

Betzy has been updated, you should now manage to run with hcoll enabled.

What I wrote about earlier (ml swap hpcx/2.20) doesn't really do anything

if you are referring to the original uct_ud_ep_deferred_timeout_handler problem, then to my knowledge you should generally not see it with either toolchain. If you do, please report this and I will have to investigate more. So far I assume this was a transient InfiniBand problem, maybe a broken compute node. But if this is reproducible on different compute nodes, it might be more serious than that.

I had problems with running on 2022a last thursday, but the tests with just release-noresm2.0.9 checkout worked.
Also, Ingo, if you have timing files from old cases, can you see if there is any significant slowdown?

@IngoBethke
Copy link
Contributor

@mvdebolskiy
Thanks for sharing. It will be interesting to see if the ucx errors are gone for good or still occur (with hpcx/2.14 and/or hpcx/2.20).

My latest simulations ran 5-10% slower on the 11 October but otherwise at steady pace during 5-14 October. In comparison, my performed in August and September using the 2020 libraries were about 10-15% slower than my latest ones. So my experience is that my simulations are running a bit faster now.

You can check my timing files in /cluster/projects/nn9039k/people/ingo/noresm2-lesfmip/cases/NHISTfrc2_f19_tn14_LESFMIPhist-all/NHISTfrc2_f19_tn14_LESFMIPhist-all_001/timing and /cluster/projects/nn9039k/people/ingo/noresm2-lesfmip/cases/N1850frc2_f19_tn14_LESFMIPhist-nat/N1850frc2_f19_tn14_LESFMIPhist-nat_001/timing. I can give you access to the nn9039k project space.

@mvdebolskiy
Copy link

I have access. I will check.

@adagj
Copy link
Contributor

adagj commented Oct 16, 2024

NorESM job does not stop when error occurred. Since the tasks exit without send error to mpi lib. There's a argument in srun '-K1' (or --kill-on-bad-exit=1) can stop the job when any task exit with error. It does not solve problem, but perhaps useful here.

My noresm_2_5_alpha06 simulation crashed yesterday evening, but I didn't realize it until this afternoon when I checked the run directory. The job appeared to be running correctly, as it was still listed in squeue without any issues. However, upon reviewing the cesm.log, I found the following error:

libesmf.so 000014EEAA8059BA _ZN5ESMCI3VMK5ent Unknown Unknown libesmf.so 000014EEAA81E33E _ZN5ESMCI2VM5ente Unknown Unknown libesmf.so 000014EEAA235CB1 c_esmc_ftablecall Unknown Unknown libesmf.so 000014EEAAAAA2AB esmf_compmod_mp_e Unknown Unknown libesmf.so 000014EEAACE71C1 esmf_gridcompmod_ Unknown Unknown cesm.exe 000000000043B0B8 MAIN__ 141 esmApp.F90 cesm.exe 0000000000429922 Unknown Unknown Unknown libc.so.6 000014EEA8A3FEB0 Unknown Unknown Unknown libc.so.6 000014EEA8A3FF60 __libc_start_main Unknown Unknown cesm.exe 0000000000429825 Unknown Unknown Unknown srun: error: b2224: tasks 1792-1916,1918-1919: Exited with exit code 1 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. <ter/work/users/adagj/noresm/n1850.ne30_tn14.hybrid_fatessp.202401007/run/cesm.log.1012089.241014-204922"

@jmaerz
Copy link
Contributor

jmaerz commented Oct 17, 2024

Getting strange messages piped to the display from where I submitted the job:

 kernel:watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [fuse:1953495]

Message from syslogd@login-3 at Oct 17 21:09:14 ...
 kernel:watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [fuse:1953430]

This happens for various CPU#X numbers, while the model keeps running and producing output. Not sure, to what the displayed message actually relates (see e.g.: https://www.suse.com/support/kb/doc/?id=000018705).

@mvdebolskiy
Copy link

@jmaerz that's from login nodes too.
They've broke something again :D

@jmaerz
Copy link
Contributor

jmaerz commented Oct 18, 2024

Thanks for the explanation - I was guessing so, but wasn't entirely sure.

@IngoBethke
Copy link
Contributor

Betzy's queuing system seems to be broken or put on pause. My queued jobs got all "requeued" i.e. cancelled and newly submitted jobs get not even processed properly by the queuing system. That other users are subject to the same issue is apparent from Betzy's load chart

@rosiealice
Copy link

I had the same problem. My jobs just started disappearing without trace...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

7 participants