Skip to content

[rrfs-mpas-jedi] Updates for running rrfs-workflow on WCOSS2 #803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

SamuelDegelia-NOAA
Copy link

@SamuelDegelia-NOAA SamuelDegelia-NOAA commented May 23, 2025

DESCRIPTION OF CHANGES:

This PR adds config files and various other updates to allow running rrfs-workflow (version 2) on WCOSS2. Results from the 2024052700 retro are compared against results from Hera in #773. The workflow appears to be working as expected on WCOSS2.

A few notes:

  • I had to update MPAS-Model and MPASSIT to build with the cray-specific wrapper compilers (cc, CC, and ftn). These compilers are needed to correctly handle MPI on WCOSS2. IFor MPASSIT, this is handled by a hash update. For MPAS-Model, since we only plan to update the model at certain times, I instead replace Makefile using the _workaround_ method.
  • There were some issues with the sourcing of versions/build.ver which overwrote the module versions loaded for UPP. To solve this, I added versions/unset.ver which clears these vars before loading sorc/UPP/modulefiles/wcoss2.lua.
  • The MPI_RUN_CMD for WCOSS2 needs job configuration info (e.g., NTASKS, PPN) and thus is defined in workflow/sideload/launch.sh instead of in the exp.setup file. This means we can probably remove MPI_RUN_CMD from the exp.setup files in a future update.
  • The spack-stack 1.6.0 installation on WCOSS2 does not include some of the python modules we need for the offline domain check in ioda_bufr. So instead we source a python virtual environment used for RRFSv1.

TESTS CONDUCTED:

Ran 24-h of cycling with the default exp/exp.conus12km configuration file. Results are shown in issue #773.

Machines/Platforms:

  • WCOSS2
    • Cactus/Dogwood
    • Acorn
  • RDHPCS
    • Hera
    • Jet
    • Orion
    • Hercules

ISSUE:

Resolves #773

@guoqing-noaa
Copy link
Contributor

README.md can be updated to include WCOSS2 now, :)

@guoqing-noaa
Copy link
Contributor

@SamuelDegelia-NOAA I agree with you that MPI_RUN_CMD is better to be moved into config_resources/config.${machine}
We also better add a "SCHEDULER" variable and add it into config_resources/config.${machine} as well.

But that can be done in a separate PR. Thanks!

@SamuelDegelia-NOAA
Copy link
Author

README.md can be updated to include WCOSS2 now, :)

Done!

Copy link
Contributor

@guoqing-noaa guoqing-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Thanks a lot for completing this heavy-lift work and addressing my comments!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like these definitions of things like NDATE and FSYNC shouldn't be needed for WCOSS2. If the prod_util module is loaded, they should be available. Or is something different with the spack libraries?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just kept the same method for loading the prod_util commands that we used for the other machines. This modulefile manually defines these variables. But it does look like we could just load prod_util through spack-stack (or as a default module such as available on WCOSS2) and these commands and variables would be available without needing this extra modulefile/prod_util directory.

@guoqing-noaa Do you know why we went with this method that manually defines the NDATE etc. variables instead of just loading prod_util through spack-stack?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MatthewPyle-NOAA and @SamuelDegelia-NOAA
As @SamuelDegelia-NOAA mentioned, the wcoss2.lua is essentially a copy of the original prod_util lua file of whatever available on each platform.
I did not recall all the details. But I think the reason to load this separately is that we only want to load modules as needed. The workflow only uses err_exit, err_chk, NDATE, cpreq from the prod_util, so we don't need any module dependencies in the original prod_util. Also, I think we will load a few extra modules if we load prod_util directly from the spack-stack.
We may revisit this solution in the future. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations, @SamuelDegelia-NOAA and @guoqing-noaa

@MatthewPyle-NOAA MatthewPyle-NOAA merged commit 02b4219 into NOAA-EMC:rrfs-mpas-jedi May 27, 2025
4 checks passed
@SamuelDegelia-NOAA SamuelDegelia-NOAA deleted the feature/wcoss2_run branch May 27, 2025 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants