Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

App=S2S does not compile with Cheyenne.gnu; failure is not reported in regression test #697

Closed
DeniseWorthen opened this issue Jul 20, 2021 · 8 comments · Fixed by #702
Closed
Assignees
Labels
bug Something isn't working

Comments

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Jul 20, 2021

Description

The coupled model does not compile on Cheyenne.gnu after the recent upgrade to a newer version. The compile failure is not reported so it appears that the RT for Cheyenne.gnu is successful.

To Reproduce:

Cheyenne.gnu
Check the compile_006/err file in the regression test run directory.
The RT log for PR #639 cheyenne.gnu shows only 5 compile jobs; the compile jobs for app=s2s are missing but the RT job is reported as being successful.

Additional context

Trying to compile the following in Cheyenne.gnu:

COMPILE | -DAPP=S2S -DCCPP_SUITES=FV3_GFS_2017_coupled,FV3_GFS_v16_coupled,FV3_GFS_v16_coupled_nsstNoahmpUGWPv1 -DDEBUG=ON        | - wcoss_cray                            | fv3 |

gives the attached err log file.

err.txt

@DeniseWorthen DeniseWorthen added the bug Something isn't working label Jul 20, 2021
@DusanJovic-NOAA
Copy link
Collaborator

DusanJovic-NOAA commented Jul 20, 2021

Please run:

RT_COMPLIER=gnu ./rt.sh -n cpld_debug -l rt_gnu.conf -e

on Cheyenne and see what's the slurm status at the end of log_hera.gnu/compile_001.log

@DeniseWorthen
Copy link
Collaborator Author

TEST 001 compile is submitted
1 min. TEST 001 compile is waiting in a queue,  status: Q jobid 9550867
2 min. TEST 001 compile is waiting in a queue,  status: Q jobid 9550867
3 min. TEST 001 compile is waiting in a queue,  status: Q jobid 9550867
4 min. TEST 001 compile is waiting in a queue,  status: Q jobid 9550867
5 min. TEST 001 compile is waiting in a queue,  status: Q jobid 9550867
6 min. TEST 001 compile is running,  status: R jobid 9550867
qstat: 9550867.chadmin1.ib0.cheyenne.ucar.edu Job has finished, use -x or -H to obtain historical job information
qstat: 9550867.chadmin1.ib0.cheyenne.ucar.edu Job has finished, use -x or -H to obtain historical job information
7 min. TEST 001 compile is finished,  status: - jobid 9550867
+ cp '/glade/scratch/worthen/FV3_RT/rt_44866/compile_001/compile_*_time.log' /glade/work/worthen/ufs-weather-model-gnu/tests/log_cheyenne.gnu
cp: cannot stat '/glade/scratch/worthen/FV3_RT/rt_44866/compile_001/compile_*_time.log': No such file or directory

@DusanJovic-NOAA
Copy link
Collaborator

@DeniseWorthen Thanks. I see Cheyenne is not using slurm but pbs scheduler.
Can you please make this change in rt_utils.sh and rerun:

$ git diff rt_utils.sh
diff --git a/tests/rt_utils.sh b/tests/rt_utils.sh
index cfe5e7c..6aa80f0 100755
--- a/tests/rt_utils.sh
+++ b/tests/rt_utils.sh
@@ -126,7 +126,7 @@ submit_and_wait() {
         status_label='held in a queue'
       elif [[ $status = 'R' ]];  then
         status_label='running'
-      elif [[ $status = 'E' ]] || [[ $status = 'C' ]];  then
+      elif [[ $status = 'E' ]] || [[ $status = 'C' ]] || [[ $status = '-' ]];  then
         status_label='finished'
         test_status='DONE'
         exit_status=$( qstat ${jobid} -x -f | grep Exit_status | awk '{print $3}')

@DeniseWorthen
Copy link
Collaborator Author

This is the log_cheyenne.gnu/compile_001.log now:

TEST 001 compile is waiting to enter the queue
TEST 001 compile is submitted
1 min. TEST 001 compile is waiting in a queue,  status: Q jobid 9551700
2 min. TEST 001 compile is waiting in a queue,  status: Q jobid 9551700
3 min. TEST 001 compile is waiting in a queue,  status: Q jobid 9551700
4 min. TEST 001 compile is running,  status: R jobid 9551700
qstat: 9551700.chadmin1.ib0.cheyenne.ucar.edu Job has finished, use -x or -H to obtain historical job information
qstat: 9551700.chadmin1.ib0.cheyenne.ucar.edu Job has finished, use -x or -H to obtain historical job information
5 min. TEST 001 compile is finished,  status: - jobid 9551700


Test 001 compile FAIL

@DusanJovic-NOAA
Copy link
Collaborator

Now the status of the compile job is 'FAIL'. The rt.sh should now cancel all RUN jobs that depend on this COMPILE job, and the final status of the regression test should be FAILED. Can you confirm this?

@DeniseWorthen
Copy link
Collaborator Author

Yes, I will try the standard rt_gnu.conf at the current develop.

@DusanJovic-NOAA
Copy link
Collaborator

And there should be a fail_test file in tests directory.

@DeniseWorthen
Copy link
Collaborator Author

On cheyenne.gnu, I now see the compile failures reported (in fail_test)

compile 007 failed
compile 008 failed
compile 006 failed

The Regression test log also now correctly reports failure:

FAILED TESTS:
Test compile 007 failed failed
Test compile 008 failed failed
Test compile 006 failed failed

REGRESSION TEST FAILED
Tue Jul 20 20:35:40 MDT 2021
Elapsed time: 08h:43m:59s. Have a nice day!

epic-cicd-jenkins pushed a commit that referenced this issue Apr 17, 2023
)

## DOCUMENTATION:
This PR removes the `FV3_CPT_v0`, `FV3_GSD_v0`, and `FV3_GSD_SAR` suites from the workflow.  This consists of:
1. Removing these suites from ex-scripts, templates, and the set of valid values for the variable `CCPP_PHYS_SUITE`,
2. Removing the `diag_table_...` and `field_table_...` files for these suites.
3. Removing WE2E tests in the `grids_extrn_mdls_suites_community` category (which are tests to make sure that specific combinations of grids, external models, and suites work well together) that use these suites.
4. Modifying the three WE2E tests in the `wflow_features` category (`get_from_HPSS_ics_HRRR_lbcs_RAP`, `get_from_HPSS_ics_RAP_lbcs_RAP`, and `specify_DT_ATMOS_LAYOUT_XY_BLOCKSIZE`) that happen to use the `FV3_GSD_SAR` suite such that they now use the `FV3_HRRR` suite. (There are no such tests that use the `FV3_CPT_v0` and `FV3_GSD_v0` suites.)  Note that we don't remove these tests because their purpose is not to test the suite but to test fetching of files from HPSS (`get_from_HPSS_ics_HRRR_lbcs_RAP` and `get_from_HPSS_ics_RAP_lbcs_RAP`) and to test that the experiment variables `DT_ATMOS`, `LAYOUT_X`, `LAYOUT_Y`, and `BLOCKSIZE` can be correctly specified in the user's experiment configuration file (`specify_DT_ATMOS_LAYOUT_XY_BLOCKSIZE`
5. Updating comments in scripts that may refer to one of these three suites.

This PR also makes improvements to the `tests/get_expts_status.sh` script that is used to check the status of a set of experiments in a specified directory.

## DEPENDENCIES:
PR #[224](ufs-community/ufs-srweather-app#224) in the `ufs-srweather-app` repo.

## TESTS CONDUCTED:
Ran the following tests on Hera:
```
grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1alpha
grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
nco_grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_HRRR
get_from_HPSS_ics_HRRR_lbcs_RAP
get_from_HPSS_ics_RAP_lbcs_RAP
specify_DT_ATMOS_LAYOUT_XY_BLOCKSIZE
```
All succeeded.  Also, since the modifications to the `FV3.input.yml` file affect the `FV3_RRFS_v1alpha`, `FV3_RRFS_v1beta`, and `FV3_HRRR` suites, the `input.nml` files for these suites generated using the (original) `develop` branch were compared to the ones generated using this branch/PR, and all were found to be identical.

## ISSUE (optional): 
Resolves Issue #668.
epic-cicd-jenkins pushed a commit that referenced this issue Apr 17, 2023
## DESCRIPTION OF CHANGES: 
Several paths in the machine-specific files point to locations in user paths or old locations of static data. This PR updates paths of static data in regional_workflow/ush/machine/ to point to the official, centralized locations on Cheyenne, Hera, and Jet.

## TESTS CONDUCTED: 
Ran the following suite of end-to-end tests on Cheyenne and Jet prior to the latest ufs-weather-model hash update. All passed. This list of tests was chosen because all of these tests are known to succeed on all tested platforms, and this tests a variety of input and boundary condition types.

- grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
- grid_RRFS_CONUS_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
- grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
- grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
- grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
- grid_RRFS_CONUS_25km_ics_HRRR_lbcs_HRRR_suite_RRFS_v1beta
- grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_HRRR
- grid_RRFS_CONUS_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
- grid_RRFS_CONUS_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta


On Hera, I ran tests with the latest SRW hash, which included the updated weather model. Because of this, many tests could not be generated due to using old, removed CCPP suites (see issue #668). To get around this issue, I tested with the fixes from #697 incorporated into my branch. With those extra commits, all "get_extrn_ics" and "get_extrn_lbcs" tasks completed successfully, which indicates that all data is in its correct place.

## ISSUE (optional): 
Will resolve a few issues in #673, many remain however.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment