C384 P7 coupled tests fail on cheyenne #698

DeniseWorthen · 2021-07-20T13:49:42Z

Description

This job fails to run at startup with an error: MPT: shepherd terminated: r5i4n4.ib0.cheyenne.ucar.edu - job aborting

To Reproduce:

Cheyenne.intel

Additional context

Found during testing for PR #639. The test was turned off for cheyenne.intel until the issue can be resolved.

Since the wave model cannot be compile in debug mode, an equivalent test was created for a non-wave bmark_p7b configuration. The test can be accessed using rt.test in this branch

When running in debug mode, the model fails with the following:

2:MPT:     header=header@entry=0x7ffdcfe45c10 "MPT ERROR: Rank 32(g:32) received signal SIGFPE(8).\n\tProcess ID: 62280, Host: r7i7n1, Program: /glade/scratch/worthen/FV3_RT/rt_13027/cpld_bmark_v16_p7b/fv3.exe\n\tMPT Version: HPE MPT 2.22  03/31/20 15"...) at sig.c:340
32:MPT: #3  0x00002b177c55e4ff in first_arriver_handler (signo=signo@entry=8,
32:MPT:     stack_trace_sem=stack_trace_sem@entry=0x2b1786c00080) at sig.c:489
32:MPT: #4  0x00002b177c55e793 in slave_sig_handler (signo=8, siginfo=<optimized out>,
32:MPT:     extra=<optimized out>) at sig.c:565
32:MPT: #5  <signal handler called>
32:MPT: #6  0x0000000008047c67 in module_sf_noahmplsm::energy (parameters=..., ice=0,
32:MPT:     vegtyp=17, ist=1, nsnow=3, nsoil=4, isnow=0, dt=300,
32:MPT:     rhoair=1.1787577636261917, sfcprs=101003.09822185335,
32:MPT:     qair=0.011852218429786931, sfctmp=296.38129134030879,
32:MPT:     thair=296.38129134030879, lwdn=345.52102043841046, uu=3.4295500034750659,
32:MPT:     vv=-3.4672281739002062, zref=10.812501434103201,
32:MPT:     co2air=39.896223797632075, o2air=21109.64752836735, solad=..., solai=...,
32:MPT:     cosz=-0.91289500682162039, igs=1, eair=1910.8519305175637,
32:MPT:     tbot=299.26776123046875, zsnso=..., zsoil=..., elai=0, esai=0, fwet=0,
32:MPT:     foln=1, fveg=0, pahv=0, pahg=0, pahb=0, qsnow=0, dzsnso=...,
32:MPT:     lat=0.21661703129464016, canliq=0, canice=0, iloc=6, jloc=-9999,
32:MPT:     z0wrf=nan(0x7baddadbaddad), imelt=..., snicev=..., snliqv=..., epore=...,
32:MPT:     t2m=nan(0x7baddadbaddad), fsno=0, sav=0, sag=0,
32:MPT:     qmelt=nan(0x7baddadbaddad), fsa=0, fsr=0, taux=nan(0x7baddadbaddad),
32:MPT:     tauy=nan(0x7baddadbaddad), fira=nan(0x7baddadbaddad),
32:MPT:     fsh=nan(0x7baddadbaddad), fcev=nan(0x7baddadbaddad),
32:MPT:     fgev=nan(0x7baddadbaddad), fctr=nan(0x7baddadbaddad),
32:MPT:     trad=nan(0x7baddadbaddad), psn=nan(0x7baddadbaddad),
32:MPT:     apar=nan(0x7baddadbaddad), ssoil=nan(0x7baddadbaddad), btrani=...,
32:MPT:     btran=9.9999999999999995e-07, ponding=nan(0x7baddadbaddad),
32:MPT:     ts=nan(0x7baddadbaddad), latheav=nan(0x7baddadbaddad),
32:MPT:     latheag=nan(0x7baddadbaddad), frozen_canopy=3435973836,
32:MPT:     frozen_ground=3435973836, tv=291.71072387695312, tg=291.71072387695312,
32:MPT:     stc=..., snowh=0, eah=2000, tah=291.71072387695312, sneqvo=0, sneqv=0,
32:MPT:     sh2o=..., smc=..., snice=..., snliq=..., albold=0.65000000000000002, cm=0,
32:MPT:     ch=0, dx=-9999, dz8w=-9999, q2=0.011852218429786931, tauss=0, laisun=0,
32:MPT:     laisha=0, rb=0, errmsg=..., errflg=0, qc=-9999, qsfc=9.99e+20,
32:MPT:     psfc=101128.33749723693, t2mv=0, t2mb=nan(0x7baddadbaddad), fsrv=0,
32:MPT:     fsrg=0, rssun=nan(0x7baddadbaddad), rssha=nan(0x7baddadbaddad), albd=...,
32:MPT:     albi=..., albsnd=..., albsni=..., bgap=0, wgap=0,
32:MPT:     tgv=nan(0x7baddadbaddad), tgb=nan(0x7baddadbaddad),
32:MPT:     q1=nan(0x7baddadbaddad), q2v=0, q2b=nan(0x7baddadbaddad),
32:MPT:     q2e=nan(0x7baddadbaddad), chv=0, chb=nan(0x7baddadbaddad),
32:MPT:     emissi=nan(0x7baddadbaddad), pah=0, shg=0, shc=0,
32:MPT:     shb=nan(0x7baddadbaddad), evg=0, evb=nan(0x7baddadbaddad), ghv=0,
32:MPT:     ghb=nan(0x7baddadbaddad), irg=0, irc=0, irb=nan(0x7baddadbaddad), tr=0,
32:MPT:     evc=0, chleaf=0, chuc=0, chv2=0, chb2=nan(0x7baddadbaddad),
32:MPT:     .tmp.ERRMSG.len_V$698=512)

The text was updated successfully, but these errors were encountered:

climbfuji · 2021-07-20T13:56:37Z

There is an awful number of NaNs in that stack trace. Let's try to figure it out today between my meetings.

DeniseWorthen · 2021-08-20T09:57:23Z

While testing restarts (ie, no waves) for the upcoming RTs for the coupled model with P7 configuration, I have been able to run the c192 coupled P7 test on Cheyenne intel (non-wave). However, the c384 P7 case (non-wave) still fails with the MPT error.

DeniseWorthen · 2021-09-30T12:31:08Z

The C384 P7 coupled cases still fail on Cheyenne for PR #765. We now have a standalone P7 test so I will make a C384 version of this to see if the error is reproducible.

DeniseWorthen · 2021-09-30T19:11:46Z

I created a control_c384_p7 test and it runs to completion on Cheyenne.intel.

DeniseWorthen · 2021-10-07T16:29:15Z

Re-opening.

DeniseWorthen · 2021-10-07T18:52:25Z

I've been able to run the cpld_control_c384_p7 test on cheyenne.intel by turning off merra2. In input.nml:

iaer = 1011 -> iaer = 5111

junwang-noaa · 2021-10-07T18:55:53Z

Are the merra2 aerosol climatology files on cheyenne input directory?

…

On Thu, Oct 7, 2021 at 2:52 PM Denise Worthen ***@***.***> wrote: I've been able to run the cpld_control_c384_p7 test on cheyenne.intel by turning off merra2. In input.nml: iaer = 1011 -> iaer = 5111 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#698 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AI7D6TMNYPIGCKN46YDPOB3UFXT7NANCNFSM5AV3AQJQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

DeniseWorthen · 2021-10-07T18:56:59Z

Yes, I have the aeroclim.m*.nc files in the run directory.

climbfuji · 2021-10-07T18:57:18Z

The input directories used by the regression tests are identical. If the data comes from elsewhere, then almost certainly Cheyenne doesn't have it.

…

On Oct 7, 2021, at 12:56 PM, Jun Wang ***@***.***> wrote: Are the merra2 aerosol climatology files on cheyenne input directory? On Thu, Oct 7, 2021 at 2:52 PM Denise Worthen ***@***.***> wrote: > I've been able to run the cpld_control_c384_p7 test on cheyenne.intel by > turning off merra2. In input.nml: > > iaer = 1011 -> iaer = 5111 > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#698 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AI7D6TMNYPIGCKN46YDPOB3UFXT7NANCNFSM5AV3AQJQ> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > > — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#698 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5C2RJQMUIEWXS4JGDWJYDUFXUMHANCNFSM5AV3AQJQ>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

DeniseWorthen · 2021-10-07T19:05:25Z

I went back and checked the commit history. The test first failed after we converted the v16 test to a v16_p7b test.

climbfuji · 2021-10-07T19:07:34Z

We should keep in mind that the Intel compiler is significantly newer than on any of the other platforms (2021.2). Does it run with GNU?

…

On Oct 7, 2021, at 1:05 PM, Denise Worthen ***@***.***> wrote: I went back and checked the commit history. The test first failed after we converted the v16 test to a v16_p7b test. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#698 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5C2RMKHKG5UW3XI6MBBN3UFXVQBANCNFSM5AV3AQJQ>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

DeniseWorthen · 2021-10-07T19:08:25Z

I first found that turning it off w/ gnu-debug worked. I will test w/ the non-debug version.

DeniseWorthen · 2021-10-07T21:25:01Z

I used both intel and gnu in non-debug mode. First I ran the standard cpld_control_c384_p7 test and they both failed with the MPT error. I copied the run directories and changed iaer to 5111. Both cases ran. The run directories are on cheyenne:

/glade/scratch/worthen/FV3_RT/c384_test_gnu and c384_test_intel

The gnu case timed out but did run all the way to the fh=6.

junwang-noaa · 2021-10-08T02:29:30Z

Can you copy a run directory from hera to cheyenne and change the executable/job_card/modulefile to see if it runs?

…

On Thu, Oct 7, 2021 at 5:25 PM Denise Worthen ***@***.***> wrote: I used both intel and gnu in non-debug mode. First I ran the standard cpld_control_c384_p7 test and they both failed with the MPT error. I copied the run directories and changed iaer to 5111. Both cases ran. The run directories are on cheyenne: /glade/scratch/worthen/FV3_RT/c384_test_gnu and c384_test_intel The gnu case timed out but did run all the way to the fh=6. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#698 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AI7D6TPP32CORZ7ZZEGH4ZLUFYF3RANCNFSM5AV3AQJQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

DeniseWorthen · 2021-10-08T20:43:30Z

It appears that increasing the memory available on Cheyenne allows the cpld_control_c384_p7 to run with MERRA2 turned on (iaer=1011).

I made this change in the job_card:

< #PBS -l select=27:ncpus=18:mpiprocs=18
---
> #PBS -l select=14:ncpus=36:mpiprocs=36

...

mpiexec_mpt -p %g: -np 480 ./fv3.exe

I'm not sure that is entirely the right way but it does complete the 6 hours for intel.

DeniseWorthen · 2021-10-09T19:54:14Z

I believe the issue is the memory footprint of the ingested Merra2 data. The values are stored in the files as float (r4), but when read-in are immediately promoted to double precision. They are then interpolated in time and space for use by the ATM. If the ingested values are kept as single precision and promoted to double precision when they are interpolated, the model runs on Cheyenne with the default resources.

I have tested the following change on both cheyenne.intel and hera.intel. On hera.intel all baselines pass against develop-20211006

diff --git a/physics/aerclm_def.F b/physics/aerclm_def.F
index 157c7b96..e6682527 100644
--- a/physics/aerclm_def.F
+++ b/physics/aerclm_def.F
@@ -1,5 +1,5 @@
       module aerclm_def
-      use machine , only : kind_phys
+      use machine , only : kind_phys, kind_io4
       implicit none

       integer, parameter   :: levsaer=72, ntrcaerm=15, timeaer=12
@@ -10,8 +10,8 @@

       real (kind=kind_phys), allocatable, dimension(:) :: aer_lat
       real (kind=kind_phys), allocatable, dimension(:) :: aer_lon
-      real (kind=kind_phys), allocatable, dimension(:,:,:,:) :: aer_pres
-      real (kind=kind_phys), allocatable, dimension(:,:,:,:,:) :: aerin
+      real (kind=kind_io4),  allocatable, dimension(:,:,:,:) :: aer_pres
+      real (kind=kind_io4),  allocatable, dimension(:,:,:,:,:) :: aerin

       data aer_time/15.5, 45.,  74.5,  105., 135.5, 166., 196.5,
      &             227.5, 258., 288.5, 319., 349.5, 380.5/
diff --git a/physics/aerinterp.F90 b/physics/aerinterp.F90
index dbcf7360..4b3232ab 100644
--- a/physics/aerinterp.F90
+++ b/physics/aerinterp.F90
@@ -181,7 +181,7 @@ contains
              endif
              do i = iamin, iamax
                aerin(i,j,k,ii,imon) = 1.d0*buffx(i,j,klev,1)
-               if(aerin(i,j,k,ii,imon) < 0 .or. aerin(i,j,k,ii,imon) > 1.)  then
+               if(aerin(i,j,k,ii,imon) < 0. .or. aerin(i,j,k,ii,imon) > 1.)  then
                  aerin(i,j,k,ii,imon) = 1.e-15
                endif
              enddo   !i-loop (lon)

In testing, I found that the diag_table_template for the coupled model was not updated correctly in PR #765. Fixing the diag_template is expected to break the P7 tests because of added fields.

@willmayfield

* Workflow in python starting to work. * Use new python_utils package structure. * Some bug fixes. * Use uppercase TRUE/FALSE in var_dfns * Use config.sh by default. * Minor bug fixes. * Remove config.yaml * Update to the latest develop * Remove quotes from numbers in predef grid. * Minor bug fix. * Move validity checker to the bottom of setup * Add more unit tests. * Update with python_utils changes. * Update to latest develop additions (Need to re-run regression test) * Use set_namelist and fill_jinja_template as python functions. * Replace sed regex searches with python re. * Use python realpath. * Construct settings as dictionary before passing to fill_jinja and set_namelist * Use yaml for setting predefined grid parameters. * Use xml parser for ccpp phys suite definition file. * Remove more run_command calls. * Simplify some func argument processing. * Move different config format parsers to same file. * Use os.path.join for the sake of macosx * Remove remaining func argument processing via os.environ. * Minor bug fix in set_extrn_mdl_params.sh * Add suite defn in test_data. * Minor fixes on unittest on jet. * Simplify boolean condition checks. * Include old in renaming of old directories * Fix conflicting yaml !join tag for paths and strings. * Bug fix with setting sfcperst dict. * Imitate "readlink -m" with os.path.realpath instead of os.readlink * Don't use /tmp as that is shared by multiple users. * Bug fix with cron line, maintain quotes around TRUE/FALSE. * Update to latest develop (untested) * Bug fix with existing cron line and quotes. * Bug fix with case-sensitive MACHINE name, and empty EXPT_DIR. * Update to latest develop * More updates. * Bug fix thanks to @willmayfield! Check both starting/ending characters are brackets for shell variable to be considered an array. * Make empty EXPT_BASEDIR workable. * Update to latest develop * Update in predef grid. * Check f90nml as well. Co-authored-by: Daniel Abdi <[email protected]>

DeniseWorthen added the bug Something isn't working label Jul 20, 2021

climbfuji assigned climbfuji and DeniseWorthen Jul 20, 2021

DeniseWorthen changed the title ~~cpld_bmark_wave_v16_p7b fails on cheyenne.intel~~ C384 P7 coupled tests fail on cheyenne Sep 30, 2021

DeniseWorthen linked a pull request Sep 30, 2021 that will close this issue

Update coupled tests to use P7 configuration, add standalone P7 test suite #765

Merged

27 tasks

DeniseWorthen closed this as completed in #765 Oct 1, 2021

DeniseWorthen reopened this Oct 7, 2021

This was referenced Oct 10, 2021

Reduce memory required by MERRA2 option; fix diag_tables for P7 tests; adds C384 P7 tests to Cheyenne.intel #866

Closed

reduce memory required by MERRA2 data NCAR/ccpp-physics#757

Merged

reduce memory required by MERRA2 data NOAA-EMC/fv3atm#410

Closed

climbfuji closed this as completed in NCAR/ccpp-physics#743 Oct 29, 2021

zach1221 mentioned this issue Feb 7, 2024

New regression tests for V2 surface coldstart files #2005

Merged

42 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C384 P7 coupled tests fail on cheyenne #698

C384 P7 coupled tests fail on cheyenne #698

DeniseWorthen commented Jul 20, 2021

climbfuji commented Jul 20, 2021

DeniseWorthen commented Aug 20, 2021 •

edited

Loading

DeniseWorthen commented Sep 30, 2021

DeniseWorthen commented Sep 30, 2021

DeniseWorthen commented Oct 7, 2021

DeniseWorthen commented Oct 7, 2021

junwang-noaa commented Oct 7, 2021 via email

DeniseWorthen commented Oct 7, 2021

climbfuji commented Oct 7, 2021 via email

DeniseWorthen commented Oct 7, 2021

climbfuji commented Oct 7, 2021 via email

DeniseWorthen commented Oct 7, 2021

DeniseWorthen commented Oct 7, 2021

junwang-noaa commented Oct 8, 2021 via email

DeniseWorthen commented Oct 8, 2021 •

edited

Loading

DeniseWorthen commented Oct 9, 2021 •

edited

Loading

C384 P7 coupled tests fail on cheyenne #698

C384 P7 coupled tests fail on cheyenne #698

Comments

DeniseWorthen commented Jul 20, 2021

Description

To Reproduce:

Additional context

climbfuji commented Jul 20, 2021

DeniseWorthen commented Aug 20, 2021 • edited Loading

DeniseWorthen commented Sep 30, 2021

DeniseWorthen commented Sep 30, 2021

DeniseWorthen commented Oct 7, 2021

DeniseWorthen commented Oct 7, 2021

junwang-noaa commented Oct 7, 2021 via email

DeniseWorthen commented Oct 7, 2021

climbfuji commented Oct 7, 2021 via email

DeniseWorthen commented Oct 7, 2021

climbfuji commented Oct 7, 2021 via email

DeniseWorthen commented Oct 7, 2021

DeniseWorthen commented Oct 7, 2021

junwang-noaa commented Oct 8, 2021 via email

DeniseWorthen commented Oct 8, 2021 • edited Loading

DeniseWorthen commented Oct 9, 2021 • edited Loading

DeniseWorthen commented Aug 20, 2021 •

edited

Loading

DeniseWorthen commented Oct 8, 2021 •

edited

Loading

DeniseWorthen commented Oct 9, 2021 •

edited

Loading