Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use spack-stack python on Hercules #1133

Open
wants to merge 10 commits into
base: develop
Choose a base branch
from

Conversation

CoryMartin-NOAA
Copy link
Contributor

As the title says, this moves away from using my maintained python env entirely for GDASApp and mostly for EVA (just EVA and EMCPy are still hosted by me).

Note, I have not tested, hence why it is still a draft PR

aerorahul
aerorahul previously approved these changes May 28, 2024
Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. just one comment.

Copy link
Contributor

@DavidHuber-NOAA DavidHuber-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just one suggestion regarding python.

modulefiles/EVA/hercules.lua Outdated Show resolved Hide resolved
@CoryMartin-NOAA CoryMartin-NOAA marked this pull request as ready for review May 30, 2024 14:27
@RussTreadon-NOAA
Copy link
Contributor

Install feature/hercules-modules on Hercules. Merge develop into working copy. Build updated copy inside g-w develop at 67b833e. Update working copy of g-w sorc/jcb to current head of jcb develop at f62b9df.

Use and load updated modulefiles/GDAS/hercules.intel.lua. Run test_gdasapp with the following results

96% tests passed, 2 tests failed out of 47

Label Time Summary:
gdas-utils    =  16.14 sec*proc (9 tests)
script        =  16.14 sec*proc (9 tests)

Total Test time (real) = 1648.01 sec

The following tests FAILED:
        1759 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_ECEN (Failed)
        1763 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY (Failed)

A check of JGDAS_GLOBAL_OCEAN_ANALYSIS_ECEN.out shows that variable APRUN_OCNANALECEN is not defined.

  File "/work2/noaa/da/rtreadon/git/global-workflow/hercules/sorc/gdas.cd/bundle/gdas/ush/soca/marine_recenter.py", line 182, in run
    exec_cmd_gridgen = Executable(self.config.APRUN_OCNANALECEN)
  File "/work2/noaa/da/python/gdasapp/wxflow/20240528/src/wxflow/attrdict.py", line 79, in __getattr__
    return self.__getitem__(item)
  File "/work2/noaa/da/python/gdasapp/wxflow/20240528/src/wxflow/attrdict.py", line 84, in __missing__
    raise KeyError(name)
KeyError: 'APRUN_OCNANALECEN'

This may reflect an inconsistency between GDASApp and g-w and not a problem with this PR.

A check of JGDAS_GLOBAL_OCEAN_ANALYSIS_VRFY.out reveals a conda error

+ slurm_script[52]: set +u
+ slurm_script[53]: conda activate eva
/var/spool/slurmd/job1408168/slurm_script: line 53: conda: command not found
+ slurm_script[1]: postamble slurm_script 1717592082 127

This error appears related to the PR and may require additional changes to modulefiles/EVA/hercules.lua or the soca test.

@RussTreadon-NOAA
Copy link
Contributor

A check of /work2/noaa/da/rtreadon/git/global-workflow/hercules/env shows that APRUN_OCNANALECEN is only defined in ORION.env and HERA.env. We should check HERCULES.env and WCOSS2.env to ensure everything we need for marine DA is defined.

@CoryMartin-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA we should probably combine this with the changes we will need for Orion Rocky9

@RussTreadon-NOAA
Copy link
Contributor

Hercules remains down due to system issues. Therefore, clone feature/hercules-modules on Orion inside of working copy of g-w PR #2700. Copy modulefiles/GDAS/hercules.intel.lua to modulefiles/GDAS/orion.intel.lua. Replace

prepend_path("MODULEPATH", '/work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/envs/unified-env/install/modulefiles/Core')

with

prepend_path("MODULEPATH", '/work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env-rocky9/install/modulefiles/Core')`

Build GDASApp and run ctests with the following results

Test project /work/noaa/da/rtreadon/git/global-workflow/rename_atm/sorc/gdas.cd/build
      Start 1489: test_gdasapp_util_coding_norms
 1/48 Test #1489: test_gdasapp_util_coding_norms ........................   Passed    7.80 sec
      Start 1490: test_gdasapp_util_ioda_example
 2/48 Test #1490: test_gdasapp_util_ioda_example ........................   Passed    1.71 sec
      Start 1491: test_gdasapp_util_prepdata
 3/48 Test #1491: test_gdasapp_util_prepdata ............................   Passed    1.34 sec
      Start 1492: test_gdasapp_util_rads2ioda
 4/48 Test #1492: test_gdasapp_util_rads2ioda ...........................   Passed    0.15 sec
      Start 1493: test_gdasapp_util_ghrsst2ioda
 5/48 Test #1493: test_gdasapp_util_ghrsst2ioda .........................   Passed    0.15 sec
      Start 1494: test_gdasapp_util_rtofstmp
 6/48 Test #1494: test_gdasapp_util_rtofstmp ............................   Passed    1.00 sec
      Start 1495: test_gdasapp_util_rtofssal
 7/48 Test #1495: test_gdasapp_util_rtofssal ............................   Passed    0.88 sec
      Start 1496: test_gdasapp_util_smap2ioda
 8/48 Test #1496: test_gdasapp_util_smap2ioda ...........................   Passed    0.16 sec
      Start 1497: test_gdasapp_util_smos2ioda
 9/48 Test #1497: test_gdasapp_util_smos2ioda ...........................   Passed    0.15 sec
      Start 1498: test_gdasapp_util_viirsaod2ioda
10/48 Test #1498: test_gdasapp_util_viirsaod2ioda .......................   Passed    0.15 sec
      Start 1499: test_gdasapp_util_icecamsr2ioda
11/48 Test #1499: test_gdasapp_util_icecamsr2ioda .......................   Passed    0.14 sec
      Start 1834: test_gdasapp_check_python_norms
12/48 Test #1834: test_gdasapp_check_python_norms .......................   Passed    9.67 sec
      Start 1835: test_gdasapp_check_yaml_keys
13/48 Test #1835: test_gdasapp_check_yaml_keys ..........................   Passed    1.13 sec
      Start 1836: test_gdasapp_jedi_increment_to_fv3
14/48 Test #1836: test_gdasapp_jedi_increment_to_fv3 ....................   Passed    5.35 sec
      Start 1837: test_gdasapp_setup_cycled_exp
15/48 Test #1837: test_gdasapp_setup_cycled_exp .........................   Passed    3.75 sec

ctests did not finish before 1 pm EDT Orion downtime. Will rerun tests once Orion returns to service.

@RussTreadon-NOAA
Copy link
Contributor

Reran ctests on Orion after it returned to service. Unfortunately, 15 tests failed

69% tests passed, 15 tests failed out of 48

Label Time Summary:
gdas-utils    =  25.46 sec*proc (11 tests)
script        =  25.46 sec*proc (11 tests)

Total Test time (real) = 1164.81 sec

The following tests FAILED:
        1843 - test_gdasapp_soca_JGLOBAL_PREP_OCEAN_OBS (Failed)
        1846 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN (Failed)
        1847 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_ECEN (Failed)
        1849 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_CHKPT (Failed)
        1850 - test_gdasapp_soca_JGDAS_GLOBAL_OCEAN_ANALYSIS_POST (Failed)
        1852 - test_gdasapp_soca_incr_handler (Failed)
        1856 - test_gdasapp_snow_apply_jediincr (Failed)
        1862 - test_gdasapp_atm_jjob_var_init (Failed)
        1863 - test_gdasapp_atm_jjob_var_run (Failed)
        1864 - test_gdasapp_atm_jjob_var_inc (Failed)
        1865 - test_gdasapp_atm_jjob_var_final (Failed)
        1866 - test_gdasapp_atm_jjob_ens_init (Failed)
        1867 - test_gdasapp_atm_jjob_ens_run (Failed)
        1868 - test_gdasapp_atm_jjob_ens_inc (Failed)
        1869 - test_gdasapp_atm_jjob_ens_final (Failed)

Examine soca and atm job log files. Find AttributeError. For example, JGDAS_GLOBAL_OCEAN_ANALYSIS_ECEN.out contains

  File "/work/noaa/da/rtreadon/git/global-workflow/rename_atm/sorc/gdas.cd/bundle/gdas/ush/soca/marine_recenter.py", line 48, in __init__
    PDY = self.task_config['PDY']
AttributeError: 'MarineRecenter' object has no attribute 'task_config'
+ JGDAS_GLOBAL_OCEAN_ANALYSIS_ECEN[1]: postamble JGDAS_GLOBAL_OCEAN_ANALYSIS_ECEN 1719531579 1

and atmanlinit-18297168.out contains

 File "/work/noaa/da/rtreadon/git/global-workflow/rename_atm/ush/python/pygfs/task/atm_analysis.py", line 29, in __init__
    super().__init__(config)
  File "/work/noaa/da/rtreadon/git/global-workflow/rename_atm/ush/python/pygfs/task/analysis.py", line 30, in __init__
    self.gdasapp_j2tmpl_dir = os.path.join(self.task_config.PARMgfs, 'gdas')
AttributeError: 'AtmAnalysis' object has no attribute 'task_config'
+ slurm_script[1]: postamble slurm_script 1719531925 1

These suggest an issue with wxflow.

Other jobs failed for different reasons. JGDAS_GLOBAL_OCEAN_ANALYSIS_RUN.out contains

 4: Exception: Assertion failed: spaces_.size() >0 in ObsSpaces, line 84 of /work/noaa/da/rtreadon/git/global-workflow/rename_atm/sorc/gdas.cd/bundle/oops/src/oops/base/ObsSpaces.h
10: Exception: Assertion failed: spaces_.size() >0 in ObsSpaces, line 84 of /work/noaa/da/rtreadon/git/global-workflow/rename_atm/sorc/gdas.cd/bundle/oops/src/oops/base/ObsSpaces.h
 6: Exception: Assertion failed: spaces_.size() >0 in ObsSpaces, line 84 of /work/noaa/da/rtreadon/git/global-workflow/rename_atm/sorc/gdas.cd/bundle/oops/src/oops/base/ObsSpaces.h

@aerorahul
Copy link
Contributor

Any chance this can be resurrected? Orion is also fully spack-stack compliant, so updating that will also be appreciated.

@CoryMartin-NOAA
Copy link
Contributor Author

@aerorahul yes I'll build now fresh copies on both orion and hercules and try to debug/get this working next week

@CoryMartin-NOAA
Copy link
Contributor Author

@aerorahul do you know of any issues building UPP in the workflow on Orion?:

Building for machine orion, compiler intel
Lmod has detected the following error:  Cannot load module "stack-intel/2022.0.2" without these module(s) loaded:
   intel/2022.1.2

While processing the following module(s):
    Module fullname       Module Filename
    ---------------       ---------------
    stack-intel/2022.0.2  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env/install/modulefiles/Core/stack-intel/2022.0.2.lua
    orion                 /work2/noaa/da/cmartin/work_aug2024_orion/global-workflow/sorc/ufs_model.fd/FV3/upp/modulefiles/orion.lua

@CoryMartin-NOAA
Copy link
Contributor Author

This is still open NOAA-EMC/global-workflow#2694 so I assume the workflow does not work out of the box on Orion?

@aerorahul
Copy link
Contributor

@aerorahul do you know of any issues building UPP in the workflow on Orion?:

Building for machine orion, compiler intel
Lmod has detected the following error:  Cannot load module "stack-intel/2022.0.2" without these module(s) loaded:
   intel/2022.1.2

While processing the following module(s):
    Module fullname       Module Filename
    ---------------       ---------------
    stack-intel/2022.0.2  /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.6.0/envs/unified-env/install/modulefiles/Core/stack-intel/2022.0.2.lua
    orion                 /work2/noaa/da/cmartin/work_aug2024_orion/global-workflow/sorc/ufs_model.fd/FV3/upp/modulefiles/orion.lua

Yes. The UPP submodule in the ufs-weahther-model does not have the Rocky 8 updates. The PR has been in progress.

@aerorahul
Copy link
Contributor

This is still open NOAA-EMC/global-workflow#2694 so I assume the workflow does not work out of the box on Orion?

Correct.
Cycling with GSI is also 2x slower on Orion.

@RussTreadon-NOAA
Copy link
Contributor

@CoryMartin-NOAA , should we roll this PR into the GDASApp upgrade to spack-stack/1.8 (GDASapp issue #1283)?

@CoryMartin-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA yes I think so at this point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants