Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PIO error when using gnu (> v10.1.0) and MPT #359

Open
nmizukami opened this issue Mar 31, 2023 · 5 comments
Open

PIO error when using gnu (> v10.1.0) and MPT #359

nmizukami opened this issue Mar 31, 2023 · 5 comments
Labels
bug cesm-coupling For cesm coupling help wanted standalone For stand-alone run

Comments

@nmizukami
Copy link
Collaborator

nmizukami commented Mar 31, 2023

When using gnu compiler with MPT, PIO sync fails (seemingly randomly) as segmentation fault (invalid memory reference).

Using intel compiler with MPT works fine.
Using gnu with openmpi works fine (seems to be).
This error happen with mizuRoute with large high resolution river network data (MERIT-Hydro)

I have been running into this problem for long time (for several years now).

More specific configuration is:
gnu v12.1.0
netcdf v 4.8.1
pnetcdf v1.12.3
mpt v2.25

The trace back looks like this (run with debug mode: flag is -g -Wall -fmax-errors=0 -fbacktrace -fcheck=all). 14 through 25 are not displayed: they would be in C codes.

piolib_mod.F90 Line 1372 is just PIOc_sync(file%fh)

#13  0x2b9d2f8c8f66 in PMPI_File_write_at_all
	at /usr/src/packages/BUILD/mpt/lib/libmpi/src/romio/mpi-io/write_atall.c:61
#14  0xc53728 in ???
#15  0xc3ae8f in ???
#16  0xc38984 in ???
#17  0xc3a4f2 in ???
#18  0xc369ce in ???
#19  0xc37203 in ???
#20  0xb99763 in ???
#21  0x7b8fc1 in ???
#22  0x7b365e in ???
#23  0x7b917b in ???
#24  0x78559b in ???
#25  0x7077a9 in __piolib_mod_MOD_syncfile
	at /glade/u/home/mizukami/sandbox_mizuRoute/libraries/parallelio/src/flib/piolib_mod.F90:1372
#26  0x4193f2 in __pio_utils_MOD_sync_file
	at /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/pio_utils.f90:391
#27  0x46dcc8 in __historyfile_MOD_write_flux
	at /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/historyFile.f90:483
#28  0x58a35e in __write_simoutput_pio_MOD_output
	at /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/write_simoutput_pio.f90:224
#29  0x7042d8 in route_runoff
	at /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/standalone/route_runoff.f90:81
#30  0x7043f7 in main
	at /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/standalone/route_runoff.f90:11
MPT ERROR: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
	aborting job
MPT: Received signal 11
@nmizukami nmizukami added bug help wanted cesm-coupling For cesm coupling standalone For stand-alone run labels Apr 4, 2023
@ekluzek
Copy link
Collaborator

ekluzek commented Aug 9, 2023

I do have some GNU tests that work in the latest...

ERI_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_gnu.mizuroute-default
SMS_D_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_gnu.mizuroute-default
ERI_PS.f19_f19_rHDMAlk_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
RS_PS.f19_f19_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
ERS_PS.f19_f19_rHDMAlk_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
ERS_PS.nldas2_nldas2_rHDMA_mnldas2.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
ERS_PS.nldas2_nldas2_rUSGS_mnldas2.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
PET_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
PET_P215x8.nldas2_nldas2_rHDMA_mnldas2.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
PFS.f19_f19_rHDMA_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
SMS.f09_f09_rHDMAlk_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
SMS_D.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
SMS_D_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
SMS_Mmpi-serial_D_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
SMS_P720x4.nldas2_nldas2_rMERIT_mnldas2.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default

But, it also seems that this requires running for at least 10 years before it shows up.

This has:

gnu/10.1.0
mpt/2.25
netcdf-mpi/4.9.0
pnetcdf/1.12.3

@nmizukami
Copy link
Collaborator Author

nmizukami commented Jan 26, 2024

More updates. @ekluzek, do you think this is enough information for someone to tell what is the root cause for the error??

This is a test based on derecho with gcc and cray-mpich. The modules loaded for compilation and runs are:

 1) ncarenv/23.09 (S)   2) cmake/3.26.3   3) nccmp/1.9.1.0   4) ncview/2.1.9   5) conda/latest   6) cdo/2.2.2   7) nco/5.1.6   8) gcc/12.2.0   9) hdf5/1.12.2  10) netcdf/4.9.2  11) ncarcompilers/1.0.0  12) craype/2.7.23  13) cray-mpich/8.1.27  14) parallel-netcdf/1.12.3

Note that intel/cray-mpich and gcc/openmpi5.0.0 works fine.

The run died after several time iterations at pio_synch call. Using DDT, I was able to trace back to the pio function where it stopped.

#29 route_runoff () at /glade/u/home/mizukami/model/mizuRoute/route/build/../build/src/standalone/route_runoff.f90:81 (at 0x6e187a)
#28 write_simoutput_pio::output (ierr=0, message='o\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000!\\000\\000\\000\\000\\000\\000\\000\\201\\000\\000\\000B\\025\\000\\000pY,\\022\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\361\\017\\000\
\000\\000\\000\\000\\000\320\266\\227\\r\\000\\000\\000\\000\\360~\\227\\r\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\340\256\211\\f\\000\\000\\000\\000\\217\\340\\265)Y\\024\\000\\000@\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000P\\35
2\\211\\f\\000\\000\\000\\000\\001\\024\\265)Y\\024\\000\\000\\200\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000p)\\227\\r\\000\\000\\000\\000\\001\\360\\264)Y\\024\\000\\000\\300\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\0000x\\227\\r\\000\\000\\000\\000@s\\266'.
.., _message=256) at /glade/u/home/mizukami/model/mizuRoute/route/build/../build/src/write_simoutput_pio.f90:218 (at 0x5881bc)
#27 historyfile::sync (this=(...), ierr=0, message='s\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000A\\000\\000\\000\\000\\000\\000\\000\\240\\344;\\036\\000\\000\\000\\000\\260ky\\a\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\000 \\000\\000\
\000\\000\\000\\000\\000P\\360@\\036\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000@\\000\\000\\000\\000\\000\\000\\000\\300\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\005\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000
\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\b\\000\\000\\000\\001\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\200\\255l\\001\\000\\000\\000\\0
00\\200\\255l\\001\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000'..., _message=256) at /glade/u/home/mizukami/model/mizuRoute/route/build/../build/src/historyFile.f90:354 (at 0x47c572)
#26 pio_utils::sync_file (piofiledesc=(...), ierr=0, message='s\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000A\\000\\000\\000\\000\\000\\000\\000`\\'-\\a\\000\\000\\000\\000-\\303\\002\\000ch/vG\\002\\003\\000\\000\\000\\000\\000\\000\\000\\000\\000\\00
0\\000\\000\\000\\020\\276\\363\\035\\000\\000\\000\\000t\\305\\005\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\361\\017\\000\\000\\000\\000\\000\\0000\\221\\255\\031\\000\\000\\000\\000PY\\255\\031\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\
360\\025\\254\\031\\000\\000\\000\\000\\004\\247\\305(Y\\024\\000\\000@\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\3400\\254\\031\\000\\000\\000\\000I\\200\\305(Y\\024\\000\\000\\200\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\020f\\254\\031\\000\\000\\000\
\000'..., _message=256) at /glade/u/home/mizukami/model/mizuRoute/route/build/../build/src/pio_utils.f90:409 (at 0x43578e)
#25 piolib_mod::syncfile (file=(...)) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/flib/piolib_mod.F90:1470 (at 0x6e5e5a)
#24 PIOc_sync (ncid=129) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/clib/pio_file.c:422 (at 0x76f51a)
#23 flush_buffer (ncid=129, wmb=0x1871f970, flushtodisk=true) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/clib/pio_darray_int.c:1820 (at 0x7a9af0)
#22 PIOc_write_darray_multi (ncid=129, varids=0x1b5a8020, ioid=512, nvars=5, arraylen=42191, array=0x125066f0, frame=0x19175c40, fillvalue=0x0, flushtodisk=true) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/clib/pio_darray.c:420 (at 0x7a3b94)
#21 flush_output_buffer (file=0x190c47d0, force=true, addsize=0) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/clib/pio_darray_int.c:1765 (at 0x7a995a)
#20 ncmpi_wait_all () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x15425120f3cc)
#19 ncmpio_wait () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c1f9b)
#18 req_commit () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c1751)
#17 wait_getput () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c534c)
#16 req_aggregation () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c3781)
#15 mgetput () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c5d1a)
#14 ncmpio_read_write () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512cb319)
#13 PMPI_File_write_at_all () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500c9791)
#12 MPIOI_File_write_all () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500c7e59)
#11 ADIOI_GPFS_WriteStridedColl () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500d6216)
#10 ADIOI_GPFS_Calc_others_req () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500cede3)
#9 PMPI_Alltoallv () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424db2e1ea)
#8 MPIR_Alltoallv_impl () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424db2d1f8)
#7 MPIR_Alltoallv_intra_auto () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424db2d096)
#6 MPIR_Alltoallv_intra_scattered () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424f5b8b82)
#5 MPIC_Waitall () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424f6db226)
#4 MPIR_Waitall () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424e97a22f)
#3 MPIR_Waitall_impl () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424e911dc1)
#2 MPIDI_SHMI_progress () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424ff0092f)
#1 MPIR_Cray_Memcpy_wrapper () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424ff3aea4)
#0 _cray_mpi_memcpy_rome () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500a5f50)

Screen Shot 2024-01-25 at 10 33 25 AM

@nmizukami
Copy link
Collaborator Author

nmizukami commented Jun 12, 2024

Hi @ekluzek, I heard some issues on pnetcdf in CESM I/O during the CESM workshop (I believe at CSEG working group AND at ultra-high resolution modeling session). Coincidently I did notice that the output error in mizuRoute happens with PIO built with pnetcdf support. When PIO is built without pnetcdf (just use netcdf), mizuRoute PIO output is stable. Note that this happens only for PIO built with gnu and cray-mpich.

@ekluzek
Copy link
Collaborator

ekluzek commented Jun 13, 2024

@nmizukami in looking at both ParallelIO and pnetcdf github pages I don't see an issue about something that might explain this.

can you figure out which talks talked about this? Then we could watch the video and figure out where they talk about this. And then there might be more context to figure out where this will be talked about.

@nmizukami
Copy link
Collaborator Author

nmizukami commented Jun 13, 2024

Hi Erik, a few talks briefly mentioned pnetcdf issue are in day 2 ultra-high resolution session

SIMA talk: slide 14 or around 08:48:00 in youtube

Earthwork: slide 8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug cesm-coupling For cesm coupling help wanted standalone For stand-alone run
Projects
None yet
Development

No branches or pull requests

2 participants