App=S2S does not compile with Cheyenne.gnu; failure is not reported in regression test #697

DeniseWorthen · 2021-07-20T13:34:04Z

Description

The coupled model does not compile on Cheyenne.gnu after the recent upgrade to a newer version. The compile failure is not reported so it appears that the RT for Cheyenne.gnu is successful.

To Reproduce:

Cheyenne.gnu
Check the compile_006/err file in the regression test run directory.
The RT log for PR #639 cheyenne.gnu shows only 5 compile jobs; the compile jobs for app=s2s are missing but the RT job is reported as being successful.

Additional context

Trying to compile the following in Cheyenne.gnu:

COMPILE | -DAPP=S2S -DCCPP_SUITES=FV3_GFS_2017_coupled,FV3_GFS_v16_coupled,FV3_GFS_v16_coupled_nsstNoahmpUGWPv1 -DDEBUG=ON        | - wcoss_cray                            | fv3 |

gives the attached err log file.

err.txt

The text was updated successfully, but these errors were encountered:

DusanJovic-NOAA · 2021-07-20T16:01:01Z

Please run:

RT_COMPLIER=gnu ./rt.sh -n cpld_debug -l rt_gnu.conf -e

on Cheyenne and see what's the slurm status at the end of log_hera.gnu/compile_001.log

DeniseWorthen · 2021-07-20T16:31:41Z

TEST 001 compile is submitted
1 min. TEST 001 compile is waiting in a queue,  status: Q jobid 9550867
2 min. TEST 001 compile is waiting in a queue,  status: Q jobid 9550867
3 min. TEST 001 compile is waiting in a queue,  status: Q jobid 9550867
4 min. TEST 001 compile is waiting in a queue,  status: Q jobid 9550867
5 min. TEST 001 compile is waiting in a queue,  status: Q jobid 9550867
6 min. TEST 001 compile is running,  status: R jobid 9550867
qstat: 9550867.chadmin1.ib0.cheyenne.ucar.edu Job has finished, use -x or -H to obtain historical job information
qstat: 9550867.chadmin1.ib0.cheyenne.ucar.edu Job has finished, use -x or -H to obtain historical job information
7 min. TEST 001 compile is finished,  status: - jobid 9550867
+ cp '/glade/scratch/worthen/FV3_RT/rt_44866/compile_001/compile_*_time.log' /glade/work/worthen/ufs-weather-model-gnu/tests/log_cheyenne.gnu
cp: cannot stat '/glade/scratch/worthen/FV3_RT/rt_44866/compile_001/compile_*_time.log': No such file or directory

DusanJovic-NOAA · 2021-07-20T16:49:23Z

@DeniseWorthen Thanks. I see Cheyenne is not using slurm but pbs scheduler.
Can you please make this change in rt_utils.sh and rerun:

$ git diff rt_utils.sh
diff --git a/tests/rt_utils.sh b/tests/rt_utils.sh
index cfe5e7c..6aa80f0 100755
--- a/tests/rt_utils.sh
+++ b/tests/rt_utils.sh
@@ -126,7 +126,7 @@ submit_and_wait() {
         status_label='held in a queue'
       elif [[ $status = 'R' ]];  then
         status_label='running'
-      elif [[ $status = 'E' ]] || [[ $status = 'C' ]];  then
+      elif [[ $status = 'E' ]] || [[ $status = 'C' ]] || [[ $status = '-' ]];  then
         status_label='finished'
         test_status='DONE'
         exit_status=$( qstat ${jobid} -x -f | grep Exit_status | awk '{print $3}')

DeniseWorthen · 2021-07-20T17:28:04Z

This is the log_cheyenne.gnu/compile_001.log now:

TEST 001 compile is waiting to enter the queue
TEST 001 compile is submitted
1 min. TEST 001 compile is waiting in a queue,  status: Q jobid 9551700
2 min. TEST 001 compile is waiting in a queue,  status: Q jobid 9551700
3 min. TEST 001 compile is waiting in a queue,  status: Q jobid 9551700
4 min. TEST 001 compile is running,  status: R jobid 9551700
qstat: 9551700.chadmin1.ib0.cheyenne.ucar.edu Job has finished, use -x or -H to obtain historical job information
qstat: 9551700.chadmin1.ib0.cheyenne.ucar.edu Job has finished, use -x or -H to obtain historical job information
5 min. TEST 001 compile is finished,  status: - jobid 9551700


Test 001 compile FAIL

DusanJovic-NOAA · 2021-07-20T17:32:12Z

Now the status of the compile job is 'FAIL'. The rt.sh should now cancel all RUN jobs that depend on this COMPILE job, and the final status of the regression test should be FAILED. Can you confirm this?

DeniseWorthen · 2021-07-20T17:34:17Z

Yes, I will try the standard rt_gnu.conf at the current develop.

DusanJovic-NOAA · 2021-07-20T17:34:57Z

And there should be a fail_test file in tests directory.

DeniseWorthen · 2021-07-21T11:57:05Z

On cheyenne.gnu, I now see the compile failures reported (in fail_test)

compile 007 failed
compile 008 failed
compile 006 failed

The Regression test log also now correctly reports failure:

FAILED TESTS:
Test compile 007 failed failed
Test compile 008 failed failed
Test compile 006 failed failed

REGRESSION TEST FAILED
Tue Jul 20 20:35:40 MDT 2021
Elapsed time: 08h:43m:59s. Have a nice day!

) ## DOCUMENTATION: This PR removes the `FV3_CPT_v0`, `FV3_GSD_v0`, and `FV3_GSD_SAR` suites from the workflow. This consists of: 1. Removing these suites from ex-scripts, templates, and the set of valid values for the variable `CCPP_PHYS_SUITE`, 2. Removing the `diag_table_...` and `field_table_...` files for these suites. 3. Removing WE2E tests in the `grids_extrn_mdls_suites_community` category (which are tests to make sure that specific combinations of grids, external models, and suites work well together) that use these suites. 4. Modifying the three WE2E tests in the `wflow_features` category (`get_from_HPSS_ics_HRRR_lbcs_RAP`, `get_from_HPSS_ics_RAP_lbcs_RAP`, and `specify_DT_ATMOS_LAYOUT_XY_BLOCKSIZE`) that happen to use the `FV3_GSD_SAR` suite such that they now use the `FV3_HRRR` suite. (There are no such tests that use the `FV3_CPT_v0` and `FV3_GSD_v0` suites.) Note that we don't remove these tests because their purpose is not to test the suite but to test fetching of files from HPSS (`get_from_HPSS_ics_HRRR_lbcs_RAP` and `get_from_HPSS_ics_RAP_lbcs_RAP`) and to test that the experiment variables `DT_ATMOS`, `LAYOUT_X`, `LAYOUT_Y`, and `BLOCKSIZE` can be correctly specified in the user's experiment configuration file (`specify_DT_ATMOS_LAYOUT_XY_BLOCKSIZE` 5. Updating comments in scripts that may refer to one of these three suites. This PR also makes improvements to the `tests/get_expts_status.sh` script that is used to check the status of a set of experiments in a specified directory. ## DEPENDENCIES: PR #[224](ufs-community/ufs-srweather-app#224) in the `ufs-srweather-app` repo. ## TESTS CONDUCTED: Ran the following tests on Hera: ``` grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1alpha grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_HRRR nco_grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_HRRR get_from_HPSS_ics_HRRR_lbcs_RAP get_from_HPSS_ics_RAP_lbcs_RAP specify_DT_ATMOS_LAYOUT_XY_BLOCKSIZE ``` All succeeded. Also, since the modifications to the `FV3.input.yml` file affect the `FV3_RRFS_v1alpha`, `FV3_RRFS_v1beta`, and `FV3_HRRR` suites, the `input.nml` files for these suites generated using the (original) `develop` branch were compared to the ones generated using this branch/PR, and all were found to be identical. ## ISSUE (optional): Resolves Issue #668.

## DESCRIPTION OF CHANGES: Several paths in the machine-specific files point to locations in user paths or old locations of static data. This PR updates paths of static data in regional_workflow/ush/machine/ to point to the official, centralized locations on Cheyenne, Hera, and Jet. ## TESTS CONDUCTED: Ran the following suite of end-to-end tests on Cheyenne and Jet prior to the latest ufs-weather-model hash update. All passed. This list of tests was chosen because all of these tests are known to succeed on all tested platforms, and this tests a variety of input and boundary condition types. - grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 - grid_RRFS_CONUS_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta - grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 - grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16 - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_HRRR - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_HRRR - grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta - grid_RRFS_CONUS_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta On Hera, I ran tests with the latest SRW hash, which included the updated weather model. Because of this, many tests could not be generated due to using old, removed CCPP suites (see issue #668). To get around this issue, I tested with the fixes from #697 incorporated into my branch. With those extra commits, all "get_extrn_ics" and "get_extrn_lbcs" tasks completed successfully, which indicates that all data is in its correct place. ## ISSUE (optional): Will resolve a few issues in #673, many remain however.

DeniseWorthen added the bug Something isn't working label Jul 20, 2021

climbfuji assigned climbfuji, DeniseWorthen and DusanJovic-NOAA Jul 20, 2021

climbfuji mentioned this issue Jul 22, 2021

Wrapper PR for: Thompson inner loop, Thompson subcycling bugfix, remove snet from noah lsm, fix time dimension in restart files, rt.sh bugfix for PBS, and more! #702

Merged

13 tasks

junwang-noaa closed this as completed in #702 Jul 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

App=S2S does not compile with Cheyenne.gnu; failure is not reported in regression test #697

App=S2S does not compile with Cheyenne.gnu; failure is not reported in regression test #697

DeniseWorthen commented Jul 20, 2021 •

edited

Loading

DusanJovic-NOAA commented Jul 20, 2021 •

edited

Loading

DeniseWorthen commented Jul 20, 2021

DusanJovic-NOAA commented Jul 20, 2021

DeniseWorthen commented Jul 20, 2021

DusanJovic-NOAA commented Jul 20, 2021

DeniseWorthen commented Jul 20, 2021

DusanJovic-NOAA commented Jul 20, 2021

DeniseWorthen commented Jul 21, 2021

App=S2S does not compile with Cheyenne.gnu; failure is not reported in regression test #697

App=S2S does not compile with Cheyenne.gnu; failure is not reported in regression test #697

Comments

DeniseWorthen commented Jul 20, 2021 • edited Loading

Description

To Reproduce:

Additional context

DusanJovic-NOAA commented Jul 20, 2021 • edited Loading

DeniseWorthen commented Jul 20, 2021

DusanJovic-NOAA commented Jul 20, 2021

DeniseWorthen commented Jul 20, 2021

DusanJovic-NOAA commented Jul 20, 2021

DeniseWorthen commented Jul 20, 2021

DusanJovic-NOAA commented Jul 20, 2021

DeniseWorthen commented Jul 21, 2021

DeniseWorthen commented Jul 20, 2021 •

edited

Loading

DusanJovic-NOAA commented Jul 20, 2021 •

edited

Loading