Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in scream::DataInterpolation::setup_time_database() after SPA uses the DataInterpolation class #7066

Open
dqwu opened this issue Feb 27, 2025 · 2 comments · May be fixed by #7067
Assignees
Labels
EAMxx PRs focused on capabilities for EAMxx Frontier

Comments

@dqwu
Copy link
Contributor

dqwu commented Feb 27, 2025

This issue is confirmed to be caused by PR #6920 which allows SPA to use the DataInterpolation class.

When I tested the ne1024 scream dedacal 1-day simulation on Frontier with the latest E3SM code, the error occurred in less than 3 minutes with no output files produced.

The run script is based on https://github.com/E3SM-Project/eamxx-scripts/blob/b6cc8be5d471eac8cb48d5fdc6e9a9583bfac8aa/run_scripts/run.decadal-amip.sh (change STOP_N/REST_N to 1).

  • Compiler: craygnu-hipcc
  • Run time: 00:02:48
  • Run directory: /lustre/orion/cli115/proj-shared/wuda/e3sm_scratch/F20TR-SCREAMv1_ne1024pg2_ne1024pg2_all_pnetcdf_1day_init_20250225/run

[Core dump stack trace]

#0  scream::scorpio::get_all_times (filename="/lustre/orion/cli115/world-shared/e3sm/inputdata/atm/scream/init/spa_v3.LR.F2010.2011-2025.c_20240405.nc")
    at E3SM/components/eamxx/src/share/io/eamxx_scorpio_interface.cpp:1172
#1  0x0000000001e3ffc5 in scream::DataInterpolation::setup_time_database (this=0x4d035d30, input_files=std::vector of length 1, capacity 1 = {...}, timeline=scream::util::TimeLine::YearlyPeriodic, 
    ref_ts=...) at E3SM/components/eamxx/src/share/util/eamxx_data_interpolation.cpp:223
#2  0x000000000186bec1 in scream::SPA::initialize_impl (this=0x199ada40) at E3SM/components/eamxx/src/physics/spa/eamxx_spa_process_interface.cpp:89
#3  0x00000000019c212e in scream::AtmosphereProcess::initialize (this=0x199ada40, t0=..., run_type=scream::RunType::Initial)
    at E3SM/components/eamxx/src/share/atm_process/atmosphere_process.cpp:75
#4  0x00000000019febcd in scream::AtmosphereProcessGroup::initialize_impl (this=<optimized out>, run_type=scream::RunType::Initial)
    at E3SM/components/eamxx/src/share/atm_process/atmosphere_process_group.cpp:369
#5  0x00000000019c212e in scream::AtmosphereProcess::initialize (this=0x199c0bf0, t0=..., run_type=scream::RunType::Initial)
    at E3SM/components/eamxx/src/share/atm_process/atmosphere_process.cpp:75
#6  0x00000000019febcd in scream::AtmosphereProcessGroup::initialize_impl (this=<optimized out>, run_type=scream::RunType::Initial)
    at E3SM/components/eamxx/src/share/atm_process/atmosphere_process_group.cpp:369
#7  0x00000000019c212e in scream::AtmosphereProcess::initialize (this=0x18f02080, t0=..., run_type=scream::RunType::Initial)
    at E3SM/components/eamxx/src/share/atm_process/atmosphere_process.cpp:75
#8  0x00000000019febcd in scream::AtmosphereProcessGroup::initialize_impl (this=<optimized out>, run_type=scream::RunType::Initial)
    at E3SM/components/eamxx/src/share/atm_process/atmosphere_process_group.cpp:369
#9  0x00000000019c212e in scream::AtmosphereProcess::initialize (this=0x199903c0, t0=..., run_type=scream::RunType::Initial)
    at E3SM/components/eamxx/src/share/atm_process/atmosphere_process.cpp:75
#10 0x0000000000e5cd99 in scream::control::AtmosphereDriver::initialize_atm_procs (this=0x17170aa0)
    at E3SM/components/eamxx/src/control/atmosphere_driver.cpp:1578
#11 0x000000000051b7e4 in scream_init_atm::{lambda()#1}::operator()() const (this=<optimized out>)
    at E3SM/components/eamxx/src/mct_coupling/eamxx_cxx_f90_interface.cpp:261
#12 (anonymous namespace)::fpe_guard_wrapper<scream_init_atm::{lambda()#1}>(scream_init_atm::{lambda()#1} const&) (f=...)
    at E3SM/components/eamxx/src/mct_coupling/eamxx_cxx_f90_interface.cpp:58
#13 scream_init_atm () at E3SM/components/eamxx/src/mct_coupling/eamxx_cxx_f90_interface.cpp:255
#14 0x0000000000518554 in atm_comp_mct::atm_init_mct (eclock=..., cdata=..., x2a=..., a2x=..., nlfilename=..., _nlfilename=_nlfilename@entry=6)
    at E3SM/components/eamxx/src/mct_coupling/atm_comp_mct.F90:282
#15 0x00000000004823ae in component_mod::component_init_cc (eclock=..., comp=..., comp_init=0x517450 <atm_comp_mct::atm_init_mct>, infodata=..., nlfilename=..., seq_flds_x2c_fluxes=..., 
    seq_flds_c2x_fluxes=..., _nlfilename=6, _seq_flds_x2c_fluxes=0, _seq_flds_c2x_fluxes=0) at E3SM/driver-mct/main/component_mod.F90:258
#16 0x000000000047126b in cime_comp_mod::cime_init () at E3SM/driver-mct/main/cime_comp_mod.F90:1496
#17 0x0000000000453ca6 in cime_driver () at E3SM/driver-mct/main/cime_driver.F90:122
#18 main (argc=<optimized out>, argv=<optimized out>) at E3SM/driver-mct/main/cime_driver.F90:23
#19 0x00007fffe4f02eec in __libc_start_call_main () from /lib64/libc.so.6
#20 0x00007fffe4f02fb5 in __libc_start_main_impl () from /lib64/libc.so.6
#21 0x0000000000462551 in _start () at ../sysdeps/x86_64/start.S:115

It is likely that, due to some unknown issues, dim.length might be a huge number in line 1172 of eamxx_scorpio_interface.cpp:

#0  scream::scorpio::get_all_times (filename=...) at E3SM/components/eamxx/src/share/io/eamxx_scorpio_interface.cpp:1172
1172	  std::vector<double> times (dim.length);
@dqwu dqwu added EAMxx PRs focused on capabilities for EAMxx Frontier labels Feb 27, 2025
@bartgol
Copy link
Contributor

bartgol commented Feb 27, 2025

The helper struct PIODim by default has the attribute length = -1 (to catch uninitialized errors). I should prob add some more checks before accessing dim.lenght. My guess is that some assumption regarding the time dim in the SPA data files is incorrect, which ultimately causes this number to be still -1. Maybe the files used in unit testing are ok, but the production runs ones are slightly different? I will have to do some more in depth debugging when I get back from travel.

@dqwu
Copy link
Contributor Author

dqwu commented Feb 27, 2025

The helper struct PIODim by default has the attribute length = -1 (to catch uninitialized errors). I should prob add some more checks before accessing dim.lenght. My guess is that some assumption regarding the time dim in the SPA data files is incorrect, which ultimately causes this number to be still -1. Maybe the files used in unit testing are ok, but the production runs ones are slightly different? I will have to do some more in depth debugging when I get back from travel.

FYI, the stack trace shows that the SPA input file in this production run is /lustre/orion/cli115/world-shared/e3sm/inputdata/atm/scream/init/spa_v3.LR.F2010.2011-2025.c_20240405.nc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EAMxx PRs focused on capabilities for EAMxx Frontier
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants