Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add module files for building SCM with spack-stack on Derecho, Hera, Jet, Orion #406

Merged
merged 16 commits into from
Dec 2, 2023

Conversation

mkavulich
Copy link
Collaborator

@mkavulich mkavulich commented Nov 8, 2023

This PR introduces modulefiles for building SCM for Derecho (Intel) and Hera (Intel, GNU). It should be fairly easy to add analogous modulefiles for other EPIC-supported platforms, so let me know if that's desired.

I ran the regression test suite and there were some differences on Hera as expected. These differences were almost entirely at the precision noise level (<1e-10) except for a tests that had isolated significant differences. The vast majority of diffs across all fields and all tests were exactly 0. Differences from the baseline (compiled from top of develop with the old shell environment files) can be found in the files in the following directories if anyone wants to take a closer look:

  • /scratch1/BMC/gmtb/CCPP_regression_testing/NCAR_scm/modulefile_updates/ccpp-scm/test/artifact-release_intel
  • /scratch1/BMC/gmtb/CCPP_regression_testing/NCAR_scm/modulefile_updates/gnu/ccpp-scm/test/artifact-release_gnu

Documentation has been updated in the .tex files, but I haven't been able to re-build the PDF yet. For now instructions for building on Derecho are here:

https://docs.google.com/document/d/1Wg5dBIzwhjoYf6BhgmsUJczPEfxD3dA2yTRS5ftSICk/edit

load(pathJoin("intel-classic", os.getenv("intel_classic_ver") or "2023.0.0"))
load(pathJoin("cray-mpich", os.getenv("cray_mpich_ver") or "8.1.25"))

prepend_path("MODULEPATH","/glade/work/epicufsrt/contrib/derecho/hpc-stack/intel-classic-2023.0.0/modulefiles/stack")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope you are aware that hpc-stack is essentially frozen and that EPIC, EMC and the UFS community has moved on to spack-stack. spack-stack modules are available on all platforms (please correct me if I am wrong) and I am pretty sure they can be used as drop-in replacements for hpc-stack modules. On Derecho, you'd also have a gnu stack available, by the way: https://spack-stack.readthedocs.io/en/1.5.1/PreConfiguredSites.html#ncar-wyoming-derecho

What's more, you wouldn't need conda to build Python envs on top of the software stack, since everything should be available (I am happy to try this for you).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there is still a reliance on some non-standard python packages, like f90nml. Would we still need to create a python environment on top of the one from spack-stack in this case?

Copy link
Collaborator

@climbfuji climbfuji Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give me a few minutes to try this and answer your question please

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@climbfuji Thanks for the link; last I had checked spack-stack was still not supported on Derecho so I stuck with the libraries I knew would work. I will see if those work for the SCM build.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@climbfuji I wasn't able to get spack-stack to work on Derecho. I receive the following error when attempting to load the spack-stack modules:

   libfabric/1.15.2.0

While processing the following module(s):
    Module fullname          Module Filename
    ---------------          ---------------
    stack-cray-mpich/8.1.25  /glade/work/epicufsrt/contrib/spack-stack/derecho/spack-stack-1.5.1/envs/unified-env/install/modulefiles/intel/2021.10.0/stack-cray-mpich/8.1.25.lua
    derecho_intel            /glade/derecho/scratch/kavulich/SCM/PR_406/ccpp-scm/scm/etc/modules/derecho_intel.lua

Which doesn't make sense to me, because libfabric/1.15.2.0 is supposed to be loaded inside the stack-cray-mpich/8.1.25 module (and trying to manually load it right before that step doesn't work either). I'll admit I'm a little unclear on the finer details of these modules though, is it possible I'm not loading these in the right order or something? The modulefile I'm using is below:

help([[
This module loads libraries for building the CCPP Single-Column Model on
the CISL machine Derecho (Cray) using Intel-classic-2023.0.0
]])

whatis([===[Loads libraries needed for building the CCPP SCM on Derecho ]===])

load(pathJoin("cmake", os.getenv("cmake_ver") or "3.26.3"))
load(pathJoin("ncarenv", os.getenv("ncarenv_ver") or "23.06"))
load(pathJoin("craype", os.getenv("craype_ver") or "2.7.20"))

prepend_path("MODULEPATH","/glade/work/epicufsrt/contrib/spack-stack/derecho/spack-stack-1.5.1/envs/unified-env/install/modulefiles/Core")
load("stack-intel/2021.10.0")
load("stack-cray-mpich/8.1.25")
load("stack-python/3.10.8")

load("bacio/2.4.1")
load("sp/2.3.3")
load("w3emc/2.9.2")

setenv("CC","cc")
setenv("FC","ftn")
setenv("CXX","CC")

setenv("CMAKE_C_COMPILER","cc")
setenv("CMAKE_CXX_COMPILER","CC")
setenv("CMAKE_Fortran_COMPILER","ftn")
setenv("CMAKE_Platform","derecho.intel")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there is still a reliance on some non-standard python packages, like f90nml. Would we still need to create a python environment on top of the one from spack-stack in this case?

@grantfirl It looks like we do need to keep our own python environment, at least for now (the spack-stack environment does not contain f90nml as you expected).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, module load py-f90nml

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @climbfuji. So all the python packages in spack-stack are loaded via modules? I didn't think to look there, it looks like a lot of good packages 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all of them are modules and all python packages start with py-.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can extend these spack python environments in case packages are missing to load whatever you can as modules from spack-stack, then create a virtual environment (python3 -m venv venv) and then install missing packages via pip (python3 -m pip install NAME). This way, all the spack-stack Python utilities are used unless there are version conflicts.

prepend_path("MODULEPATH","/contrib/sutils/modulefiles")
load("sutils")

prepend_path("MODULEPATH", "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/modulefiles/Core")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the same version of spack-stack as in ufs-weather-model: https://github.com/ufs-community/ufs-weather-model/blob/develop/modulefiles/ufs_hera.intel.lua?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - ufs will move to 1.5.1 shortly. Do you want me to create a PR and update to 1.5.1? This way you don't waste your time if something goes less smooth than I was bragging.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkavulich Can you update this PR at least to spack-stack-1.5.0 and then once ufs moves to 1.5.1, we can try to follow suit. I'd like to keep the SCM using the same stack as ufs-wm going forward, we just haven't been paying attention for a while due to not having a release.

@climbfuji
Copy link
Collaborator

I'll push a branch with an example spack-stack file for hera/intel shortly. The only thing that's missing I think is the optional doyxgen. We can consider adding that to spack-stack (but I remember it's a bit of a tricky package). We can also consider adding a separate template for the scm so that users who just want that (and not the full ufs-weather-model) only have to build a few libraries. But then again, the ufs-weather-model template should be good enough.

@grantfirl grantfirl changed the title Add module files for building SCM on Derecho, Hera Add module files for building SCM on Derecho, Hera + spack-stack Nov 16, 2023
@mkavulich
Copy link
Collaborator Author

@DomHeinzeller @grantfirl I have updated the new modules to all use spack-stack 1.5.1, and also added one for Derecho GNU. I did re-run the regression tests for Hera Intel/GNU and they all passed but I have not done a comparison with the main branch baseline for Hera to ensure close-ish results; I can do that if you'd like but I just haven't had time yet.

I also rebased my branch on the latest develop to fix the CI tests, all now seem to be passing.

@grantfirl
Copy link
Collaborator

@mkavulich I've tried loading hera_intel on Hera with this code, and it works fine for me. I'm guessing that we'll need to tell folks to manually set the SCM_ROOT variable or does it make any sense to try to set it via the lua file?

@grantfirl
Copy link
Collaborator

grantfirl commented Nov 16, 2023

@mkavulich @dustinswales How could we use spack-stack for the CI tests? I'm guessing that if we want to switch that over too, we'll do that in a separate PR?

For example, see https://github.com/JCSDA/spack-stack/blob/release/1.5.1/.github/workflows/ubuntu-ci-x86_64.yaml for setting up the environment?

@mkavulich
Copy link
Collaborator Author

@mkavulich I've tried loading hera_intel on Hera with this code, and it works fine for me. I'm guessing that we'll need to tell folks to manually set the SCM_ROOT variable or does it make any sense to try to set it via the lua file?

I've been meaning to talk to you about that. It seems to me like using this variable is unnecessary complexity, if this is set automatically through a setup script or modulefile why don't we just set it directly in the python script?

@grantfirl
Copy link
Collaborator

@mkavulich I've tried loading hera_intel on Hera with this code, and it works fine for me. I'm guessing that we'll need to tell folks to manually set the SCM_ROOT variable or does it make any sense to try to set it via the lua file?

I've been meaning to talk to you about that. It seems to me like using this variable is unnecessary complexity, if this is set automatically through a setup script or modulefile why don't we just set it directly in the python script?

Ya, that should work fine. The whole idea of having SCM_ROOT in the first place was to allow for flexibility with respect to where executables are stored and where the output goes. In the run script, we could check if the SCM_ROOT environment variable exists. If so, use it, if not, find the top level ccpp-scm directory above where the run script is being called and use that.

@climbfuji
Copy link
Collaborator

@mkavulich @dustinswales How could we use spack-stack for the CI tests? I'm guessing that if we want to switch that over too, we'll do that in a separate PR?

For example, see https://github.com/JCSDA/spack-stack/blob/release/1.5.1/.github/workflows/ubuntu-ci-x86_64.yaml for setting up the environment?

You could try to pull the containers we create for JEDI CI, they should have all the dependencies you need (but I agree that making this or any other solution a separate PR is better)

@grantfirl
Copy link
Collaborator

grantfirl commented Nov 16, 2023

@mkavulich I'm running into issues on Derecho. It apparently can't find NetCDF-fortran. You don't get this error?

CMake Error at /glade/work/grantf/ccpp-scm/CMakeModules/Modules/FindNetCDF.cmake:246 (message):
Unable to properly find NetCDF. Found static libraries at:
/glade/work/grantf/ccpp-scm/scm/src/NetCDF_Fortran_LIBRARY-NOTFOUND but
could not run nc-config:
Call Stack (most recent call first):
CMakeLists.txt:67 (find_package)

CMake Error at /glade/u/apps/derecho/23.09/spack/opt/spack/cmake/3.26.3/gcc/7.5.0/k34x/share/cmake-3.26/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
Could NOT find NetCDF (missing: Fortran) (found version "4.9.2")
Call Stack (most recent call first):
/glade/u/apps/derecho/23.09/spack/opt/spack/cmake/3.26.3/gcc/7.5.0/k34x/share/cmake-3.26/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
/glade/work/grantf/ccpp-scm/CMakeModules/Modules/FindNetCDF.cmake:312 (find_package_handle_standard_args)
CMakeLists.txt:67 (find_package)

I see that the Hera module files have:
load("netcdf-c/4.9.2")
load("netcdf-fortran/4.6.0")

but the Derecho ones do not. Is there a reason?

@grantfirl
Copy link
Collaborator

@mkavulich FYI, if I add the netCDF load commands to the Derecho lua files, everything works fine for me.

Copy link
Collaborator

@grantfirl grantfirl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with everything except NetCDF on Derecho (see comments).

scm/etc/modules/derecho_intel.lua Show resolved Hide resolved
@mkavulich
Copy link
Collaborator Author

@grantfirl Thanks for testing this out, I have had some testing frustrations because module purge does not seem to actually fully purge my environment when running different tests (or maybe there's some cmake cacheing going on that I don't understand? I'd say 50/50 a platform error or user error). I fully logged out and logged back in and started with a fresh clone and purged environment, and was able to replicate your issue. I pushed those changes for the Derecho Intel and GNU modulefiles, and also added a default value for SCM_ROOT per our other conversation.

@climbfuji
Copy link
Collaborator

@grantfirl Thanks for testing this out, I have had some testing frustrations because module purge does not seem to actually fully purge my environment when running different tests (or maybe there's some cmake cacheing going on that I don't understand? I'd say 50/50 a platform error or user error). I fully logged out and logged back in and started with a fresh clone and purged environment, and was able to replicate your issue. I pushed those changes for the Derecho Intel and GNU modulefiles, and also added a default value for SCM_ROOT per our other conversation.

To me this is a bit suspicious. The spack py-netcdf4 does depend on netcdf-c, so if that module doesn't get loaded automatically when py-netcdf4 is loaded, then something is off. Loading netcdf-fortran should also automatically load netcdf-c if it isn't loaded yet. But of course it doesn't harm to list the version explicitly.

scm/etc/modules/hera_gnu.lua Outdated Show resolved Hide resolved
scm/etc/modules/hera_intel.lua Outdated Show resolved Hide resolved
@mkavulich
Copy link
Collaborator Author

mkavulich commented Nov 17, 2023

To me this is a bit suspicious. The spack py-netcdf4 does depend on netcdf-c, so if that module doesn't get loaded automatically when py-netcdf4 is loaded, then something is off. Loading netcdf-fortran should also automatically load netcdf-c if it isn't loaded yet. But of course it doesn't harm to list the version explicitly.

@climbfuji I agree that it is suspicious that this issue is occurring. There appears to be something going on with different hdf5 versions compared to the system default. When you don't run a module purge prior to running, this is the result: of loading the current derecho_gnu.lua module:

> module load derecho_gnu

Lmod is automatically replacing "intel/2023.0.0" with "gcc/12.2.0".

Lmod Warning: 
------------------------------------------------------------------------------------------------------
The following dependent module(s) are not currently loaded: hdf5/1.14.0 (required by:
py-netcdf4/1.5.8, netcdf-c/4.9.2)
------------------------------------------------------------------------------------------------------




Due to MODULEPATH changes, the following have been reloaded:
  1) cray-mpich/8.1.25     2) craype/2.7.20     3) hdf5/1.12.2     4) ncarcompilers/1.0.0     5) netcdf/4.9.2

The following have been reloaded with a version change:
  1) ncarenv/23.06 => ncarenv/23.09

Running module purge prior to loading makes the load go much more smoothly:

> module load derecho_intel

The following have been reloaded with a version change:
  1) ncarenv/23.06 => ncarenv/23.09

Now, both of those do work, but it maybe the warning does give some hint as to why those netcdf modules need to be explicitly loaded.

@climbfuji
Copy link
Collaborator

You have to follow exactly the steps in https://spack-stack.readthedocs.io/en/latest/PreConfiguredSites.html#ncar-wyoming-derecho unless you want to set yourself up for trouble:

module purge
# ignore that the sticky module ncarenv/... is not unloaded
export LMOD_TMOD_FIND_FIRST=yes
module load ncarenv/23.09
module use /glade/work/epicufsrt/contrib/spack-stack/derecho/modulefiles
module load ecflow/5.8.4
module load mysql/8.0.33

@mkavulich
Copy link
Collaborator Author

@climbfuji so I guess that means module purge is required for using spack-stack?

I have omitted ecflow and mysql because we don't use those applications. The new modulefiles appear to be working much better (along with doing a purge first); I pushed the updated Derecho files, and I'll make and test those changes for Hera later. @grantfirl can you try again with the latest files on Derecho (remembering to module purge first)?

@climbfuji
Copy link
Collaborator

@climbfuji so I guess that means module purge is required for using spack-stack?

I have omitted ecflow and mysql because we don't use those applications. The new modulefiles appear to be working much better (along with doing a purge first); I pushed the updated Derecho files, and I'll make and test those changes for Hera later. @grantfirl can you try again with the latest files on Derecho (remembering to module purge first)?

Yes - module purge comes first.

@mkavulich mkavulich changed the title Add module files for building SCM on Derecho, Hera + spack-stack Add module files for building SCM with spack-stack on Derecho, Hera, Jet, Orion Nov 21, 2023
@mkavulich
Copy link
Collaborator Author

@climbfuji @grantfirl I am still waiting on help installing LaTeX tools for updating the users guide, but aside from that I think this PR is ready for re-review. I also added modulefiles for Jet and Orion while I was at it since it was simple to add based on the spack-stack instructions Dom sent (I don't have access to any of the other machines).

@DomHeinzeller
Copy link
Contributor

DomHeinzeller commented Nov 28, 2023 via email

@grantfirl
Copy link
Collaborator

@mkavulich Here is the PDF of the updated docs if you want to include it in this PR:
main.pdf

@grantfirl
Copy link
Collaborator

@mkavulich I'd like to re-test this on Hera and Derecho so that we can maybe get this merged today.

@grantfirl
Copy link
Collaborator

@mkavulich Can you merge in the latest NCAR/main commit: 1d8894f

grantfirl
grantfirl previously approved these changes Dec 1, 2023
@grantfirl grantfirl self-requested a review December 1, 2023 18:07
@grantfirl grantfirl dismissed their stale review December 1, 2023 18:08

Accidentally approved

@grantfirl
Copy link
Collaborator

Everything works with Intel/GNU on Hera/Derecho. @mkavulich I'll approve/merge once this is updated to the latest NCAR/main commit.

@mkavulich
Copy link
Collaborator Author

@grantfirl The branch should now be updated, and I tested one more time on Derecho with Intel. I think it's ready to go 👍

@grantfirl grantfirl merged commit b27c55d into NCAR:main Dec 2, 2023
17 checks passed
@climbfuji
Copy link
Collaborator

Yay! Welcome to the spack-stack user community :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants