You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using branch feature/noresm2_5_alpha04_v3 of https://github.com/mvertens/NorESM.git, compset 2000_DATM%JRA_SLND_CICE_BLOM_DROF%JRA_SGLC_SWAV_SESP and grid combination TL319_tn14, the simulation crashed at what seemed the first attempt to write CICE diagnostics.
The error message in cesm.log.* was:
[b4167:12257] *** An error occurred in MPI_Gather
[b4167:12257] *** reported by process [47501952548864,123]
[b4167:12257] *** on communicator MPI COMMUNICATOR 49 SPLIT FROM 44
[b4167:12257] *** MPI_ERR_TRUNCATE: message truncated
[b4167:12257] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[b4167:12257] *** and potentially your MPI job)
I had a feeling it could have something to do with LFS stripes and I saw that in env_run.xml, PIO_STRIDE was set to $MAX_MPITASKS_PER_NODE. This is 128 on Betzy, but I was running with fewer processors that this for CICE (96 processors). When manually setting PIO_STRIDE to 8 for all components (a bit random), the simulation ran fine. Not sure this is the reason for the crash, but if it is, maybe PIO_STRIDE should be set to the minimum of $MAX_MPITASKS_PER_NODE and processors per component?
The text was updated successfully, but these errors were encountered:
Using branch feature/noresm2_5_alpha04_v3 of https://github.com/mvertens/NorESM.git, compset 2000_DATM%JRA_SLND_CICE_BLOM_DROF%JRA_SGLC_SWAV_SESP and grid combination TL319_tn14, the simulation crashed at what seemed the first attempt to write CICE diagnostics.
The error message in cesm.log.* was:
[b4167:12257] *** An error occurred in MPI_Gather
[b4167:12257] *** reported by process [47501952548864,123]
[b4167:12257] *** on communicator MPI COMMUNICATOR 49 SPLIT FROM 44
[b4167:12257] *** MPI_ERR_TRUNCATE: message truncated
[b4167:12257] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[b4167:12257] *** and potentially your MPI job)
I had a feeling it could have something to do with LFS stripes and I saw that in env_run.xml, PIO_STRIDE was set to $MAX_MPITASKS_PER_NODE. This is 128 on Betzy, but I was running with fewer processors that this for CICE (96 processors). When manually setting PIO_STRIDE to 8 for all components (a bit random), the simulation ran fine. Not sure this is the reason for the crash, but if it is, maybe PIO_STRIDE should be set to the minimum of $MAX_MPITASKS_PER_NODE and processors per component?
The text was updated successfully, but these errors were encountered: