Crash of OMIP2 simulation using TL319_tn14 grid and noresm2_5_alpha04_v3 #387

matsbn · 2024-08-29T11:26:04Z

Using branch feature/noresm2_5_alpha04_v3 of https://github.com/mvertens/NorESM.git, compset 2000_DATM%JRA_SLND_CICE_BLOM_DROF%JRA_SGLC_SWAV_SESP and grid combination TL319_tn14, the simulation crashed at what seemed the first attempt to write CICE diagnostics.

The error message in cesm.log.* was:
[b4167:12257] *** An error occurred in MPI_Gather
[b4167:12257] *** reported by process [47501952548864,123]
[b4167:12257] *** on communicator MPI COMMUNICATOR 49 SPLIT FROM 44
[b4167:12257] *** MPI_ERR_TRUNCATE: message truncated
[b4167:12257] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[b4167:12257] *** and potentially your MPI job)

I had a feeling it could have something to do with LFS stripes and I saw that in env_run.xml, PIO_STRIDE was set to $MAX_MPITASKS_PER_NODE. This is 128 on Betzy, but I was running with fewer processors that this for CICE (96 processors). When manually setting PIO_STRIDE to 8 for all components (a bit random), the simulation ran fine. Not sure this is the reason for the crash, but if it is, maybe PIO_STRIDE should be set to the minimum of $MAX_MPITASKS_PER_NODE and processors per component?

TomasTorsvik · 2024-11-26T09:02:53Z

@matsbn - Is this the same issue as is being discussed in the NorESM issue 539? If so, I suggest to close the issue here.

matsbn · 2024-11-27T07:49:57Z

Yes @TomasTorsvik , it is the same issue. I'll close this one.

matsbn added this to the NorESM2.5 - BLOM/iHAMOCC milestone Aug 29, 2024

matsbn assigned mvertens Aug 29, 2024

matsbn added this to NorESM Development Aug 29, 2024

github-project-automation bot moved this to Todo in NorESM Development Aug 29, 2024

matsbn closed this as completed Nov 27, 2024

github-project-automation bot moved this from Todo to Done in NorESM Development Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash of OMIP2 simulation using TL319_tn14 grid and noresm2_5_alpha04_v3 #387

Crash of OMIP2 simulation using TL319_tn14 grid and noresm2_5_alpha04_v3 #387

matsbn commented Aug 29, 2024

TomasTorsvik commented Nov 26, 2024

matsbn commented Nov 27, 2024

Crash of OMIP2 simulation using TL319_tn14 grid and noresm2_5_alpha04_v3 #387

Crash of OMIP2 simulation using TL319_tn14 grid and noresm2_5_alpha04_v3 #387

Comments

matsbn commented Aug 29, 2024

TomasTorsvik commented Nov 26, 2024

matsbn commented Nov 27, 2024