You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is confirmed to be caused by PR #6920 which allows SPA to use the DataInterpolation class.
When I tested the ne1024 scream dedacal 1-day simulation on Frontier with the latest E3SM code, the error occurred in less than 3 minutes with no output files produced.
Run directory: /lustre/orion/cli115/proj-shared/wuda/e3sm_scratch/F20TR-SCREAMv1_ne1024pg2_ne1024pg2_all_pnetcdf_1day_init_20250225/run
[Core dump stack trace]
#0 scream::scorpio::get_all_times (filename="/lustre/orion/cli115/world-shared/e3sm/inputdata/atm/scream/init/spa_v3.LR.F2010.2011-2025.c_20240405.nc")
at E3SM/components/eamxx/src/share/io/eamxx_scorpio_interface.cpp:1172
#1 0x0000000001e3ffc5 in scream::DataInterpolation::setup_time_database (this=0x4d035d30, input_files=std::vector of length 1, capacity 1 = {...}, timeline=scream::util::TimeLine::YearlyPeriodic,
ref_ts=...) at E3SM/components/eamxx/src/share/util/eamxx_data_interpolation.cpp:223
#2 0x000000000186bec1 in scream::SPA::initialize_impl (this=0x199ada40) at E3SM/components/eamxx/src/physics/spa/eamxx_spa_process_interface.cpp:89
#3 0x00000000019c212e in scream::AtmosphereProcess::initialize (this=0x199ada40, t0=..., run_type=scream::RunType::Initial)
at E3SM/components/eamxx/src/share/atm_process/atmosphere_process.cpp:75
#4 0x00000000019febcd in scream::AtmosphereProcessGroup::initialize_impl (this=<optimized out>, run_type=scream::RunType::Initial)
at E3SM/components/eamxx/src/share/atm_process/atmosphere_process_group.cpp:369
#5 0x00000000019c212e in scream::AtmosphereProcess::initialize (this=0x199c0bf0, t0=..., run_type=scream::RunType::Initial)
at E3SM/components/eamxx/src/share/atm_process/atmosphere_process.cpp:75
#6 0x00000000019febcd in scream::AtmosphereProcessGroup::initialize_impl (this=<optimized out>, run_type=scream::RunType::Initial)
at E3SM/components/eamxx/src/share/atm_process/atmosphere_process_group.cpp:369
#7 0x00000000019c212e in scream::AtmosphereProcess::initialize (this=0x18f02080, t0=..., run_type=scream::RunType::Initial)
at E3SM/components/eamxx/src/share/atm_process/atmosphere_process.cpp:75
#8 0x00000000019febcd in scream::AtmosphereProcessGroup::initialize_impl (this=<optimized out>, run_type=scream::RunType::Initial)
at E3SM/components/eamxx/src/share/atm_process/atmosphere_process_group.cpp:369
#9 0x00000000019c212e in scream::AtmosphereProcess::initialize (this=0x199903c0, t0=..., run_type=scream::RunType::Initial)
at E3SM/components/eamxx/src/share/atm_process/atmosphere_process.cpp:75
#10 0x0000000000e5cd99 in scream::control::AtmosphereDriver::initialize_atm_procs (this=0x17170aa0)
at E3SM/components/eamxx/src/control/atmosphere_driver.cpp:1578
#11 0x000000000051b7e4 in scream_init_atm::{lambda()#1}::operator()() const (this=<optimized out>)
at E3SM/components/eamxx/src/mct_coupling/eamxx_cxx_f90_interface.cpp:261
#12 (anonymous namespace)::fpe_guard_wrapper<scream_init_atm::{lambda()#1}>(scream_init_atm::{lambda()#1} const&) (f=...)
at E3SM/components/eamxx/src/mct_coupling/eamxx_cxx_f90_interface.cpp:58
#13 scream_init_atm () at E3SM/components/eamxx/src/mct_coupling/eamxx_cxx_f90_interface.cpp:255
#14 0x0000000000518554 in atm_comp_mct::atm_init_mct (eclock=..., cdata=..., x2a=..., a2x=..., nlfilename=..., _nlfilename=_nlfilename@entry=6)
at E3SM/components/eamxx/src/mct_coupling/atm_comp_mct.F90:282
#15 0x00000000004823ae in component_mod::component_init_cc (eclock=..., comp=..., comp_init=0x517450 <atm_comp_mct::atm_init_mct>, infodata=..., nlfilename=..., seq_flds_x2c_fluxes=...,
seq_flds_c2x_fluxes=..., _nlfilename=6, _seq_flds_x2c_fluxes=0, _seq_flds_c2x_fluxes=0) at E3SM/driver-mct/main/component_mod.F90:258
#16 0x000000000047126b in cime_comp_mod::cime_init () at E3SM/driver-mct/main/cime_comp_mod.F90:1496
#17 0x0000000000453ca6 in cime_driver () at E3SM/driver-mct/main/cime_driver.F90:122
#18 main (argc=<optimized out>, argv=<optimized out>) at E3SM/driver-mct/main/cime_driver.F90:23
#19 0x00007fffe4f02eec in __libc_start_call_main () from /lib64/libc.so.6
#20 0x00007fffe4f02fb5 in __libc_start_main_impl () from /lib64/libc.so.6
#21 0x0000000000462551 in _start () at ../sysdeps/x86_64/start.S:115
It is likely that, due to some unknown issues, dim.length might be a huge number in line 1172 of eamxx_scorpio_interface.cpp:
#0 scream::scorpio::get_all_times (filename=...) at E3SM/components/eamxx/src/share/io/eamxx_scorpio_interface.cpp:1172
1172 std::vector<double> times (dim.length);
The text was updated successfully, but these errors were encountered:
The helper struct PIODim by default has the attribute length = -1 (to catch uninitialized errors). I should prob add some more checks before accessing dim.lenght. My guess is that some assumption regarding the time dim in the SPA data files is incorrect, which ultimately causes this number to be still -1. Maybe the files used in unit testing are ok, but the production runs ones are slightly different? I will have to do some more in depth debugging when I get back from travel.
The helper struct PIODim by default has the attribute length = -1 (to catch uninitialized errors). I should prob add some more checks before accessing dim.lenght. My guess is that some assumption regarding the time dim in the SPA data files is incorrect, which ultimately causes this number to be still -1. Maybe the files used in unit testing are ok, but the production runs ones are slightly different? I will have to do some more in depth debugging when I get back from travel.
FYI, the stack trace shows that the SPA input file in this production run is /lustre/orion/cli115/world-shared/e3sm/inputdata/atm/scream/init/spa_v3.LR.F2010.2011-2025.c_20240405.nc.
This issue is confirmed to be caused by PR #6920 which allows SPA to use the DataInterpolation class.
When I tested the ne1024 scream dedacal 1-day simulation on Frontier with the latest E3SM code, the error occurred in less than 3 minutes with no output files produced.
The run script is based on https://github.com/E3SM-Project/eamxx-scripts/blob/b6cc8be5d471eac8cb48d5fdc6e9a9583bfac8aa/run_scripts/run.decadal-amip.sh (change STOP_N/REST_N to 1).
[Core dump stack trace]
It is likely that, due to some unknown issues, dim.length might be a huge number in line 1172 of eamxx_scorpio_interface.cpp:
The text was updated successfully, but these errors were encountered: