-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial setup of Payu environment #1
base: main
Are you sure you want to change the base?
Conversation
06b8986
to
58da917
Compare
58da917
to
74a8e02
Compare
…d payu environment
…ctions - workflows/pull_requent.yml: Split up setup and move build base image to separate job - workflows/get_changed_env.yml: Remove deleted environment from matrix - Update workflows to source environment variables from install_config.sh
…se launcher script
… base Removed the modified micromamba as at this stage we might not need compatibility with nb_conda_kernals
d017414
to
5580a84
Compare
This is what I've tried to do with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some comments, I'll do a more full review later :)
environments/payu/deploy.sh
Outdated
NEXT_STABLE="${ENVIRONMENT}-${STABLE_VERSION}" | ||
CURRENT_UNSTABLE=$( get_aliased_module "${MODULE_NAME}"/analysis3-unstable "${CONDA_MODULE_PATH}" ) | ||
CURRENT_UNSTABLE=$( get_aliased_module "${MODULE_NAME}"/payu-unstable "${CONDA_MODULE_PATH}" ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this payu-dev
rather than payu-unstable
..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I might actually remove payu-unstable
module alias as the separate payu-dev
environment and module is a way to test the payu environment.
- name: Checkout repository | ||
uses: actions/checkout@v4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the things that I'm not too sure about when using the relative uses: ./.github/workflows/thingo.yml
as opposed to the version-specific uses: access-nri/model-release-condaenv/.github/workflows/thingo.yml@main
is that in the case where it is on.pull_request
- the checkout that you do here means that the version of the deploy workflow that you use here would be the pull request version, rather than the @main
version, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the relative paths were useful to the use the test the changes in this PR. But maybe if the build
and test
workflows work, the workflow changes should be separated into a different PR and merged first, with the versioned values @main
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that will work :)
Just some more quick details on how it's been tested. I manually ran all the Setup commandbash << 'EOF'
set -e
REPO_PATH=/home/189/jb4202/model-release-condaenv
export ADMIN_DIR="/g/data/tm70/jb4202/tmp-conda/admin/conda_containers"
export CONDA_BASE="/g/data/tm70/jb4202/tmp-conda/prerelease"
export APPS_USERS_GROUP="tm70"
export APPS_OWNERS_GROUP="tm70"
source "$REPO_PATH/scripts/install_config.sh"
source "$REPO_PATH/scripts/functions.sh"
mkdir -p "${ADMIN_DIR}" "${JOB_LOG_DIR}" "${BUILD_STAGE_DIR}"
set_admin_perms "${ADMIN_DIR}" "${JOB_LOG_DIR}" "${BUILD_STAGE_DIR}"
echo "${ADMIN_DIR}" "${CONDA_BASE}" "${JOB_LOG_DIR}"
echo "Finished setup!"
EOF Build commandbash << 'EOF'
set -e
REPO_PATH=/home/189/jb4202/model-release-condaenv
export SCRIPT_DIR="$REPO_PATH/scripts"
export CONDA_ENVIRONMENT="payu"
export ADMIN_DIR="/g/data/tm70/jb4202/tmp-conda/admin/conda_containers"
export CONDA_BASE="/g/data/tm70/jb4202/tmp-conda/prerelease"
export APPS_USERS_GROUP="tm70"
export APPS_OWNERS_GROUP="tm70"
PROJECT="tm70"
STORAGE="gdata/tm70"
source "${SCRIPT_DIR}"/install_config.sh
cd "${JOB_LOG_DIR}"
qsub -N build_"${CONDA_ENVIRONMENT}" -lncpus=1,mem=20GB,walltime=2:00:00,jobfs=50GB,storage="${STORAGE}" \
-v SCRIPT_DIR,CONDA_ENVIRONMENT,ADMIN_DIR,CONDA_BASE,APPS_USERS_GROUP,APPS_OWNERS_GROUP \
-P "${PROJECT}" -q copyq -Wblock=true -Wumask=037 \
"${SCRIPT_DIR}"/build.sh
echo "Finished Build!"
EOF Test commandbash << 'EOF'
set -e
REPO_PATH=/home/189/jb4202/model-release-condaenv
export SCRIPT_DIR="$REPO_PATH/scripts"
export CONDA_ENVIRONMENT="payu"
export ADMIN_DIR="/g/data/tm70/jb4202/tmp-conda/admin/conda_containers"
export CONDA_BASE="/g/data/tm70/jb4202/tmp-conda/prerelease"
export APPS_USERS_GROUP="tm70"
export APPS_OWNERS_GROUP="tm70"
PROJECT="tm70"
STORAGE="gdata/tm70"
source "${SCRIPT_DIR}"/install_config.sh
cd "${JOB_LOG_DIR}"
qsub -N test_"${CONDA_ENVIRONMENT}" -lncpus=4,mem=20GB,walltime=0:20:00,jobfs=50GB,storage="${STORAGE}" \
-v SCRIPT_DIR,CONDA_ENVIRONMENT,ADMIN_DIR,CONDA_BASE,APPS_USERS_GROUP,APPS_OWNERS_GROUP \
-P "${PROJECT}" -Wblock=true -Wumask=037 \
"${SCRIPT_DIR}"/test.sh
echo "Finished Test!"
EOF Deploy commandbash << 'EOF'
set -e
REPO_PATH=/home/189/jb4202/model-release-condaenv
export SCRIPT_DIR="$REPO_PATH/scripts"
export CONDA_ENVIRONMENT="payu"
export ADMIN_DIR="/g/data/tm70/jb4202/tmp-conda/admin/conda_containers"
export CONDA_BASE="/g/data/tm70/jb4202/tmp-conda/prerelease"
export APPS_USERS_GROUP="tm70"
export APPS_OWNERS_GROUP="tm70"
source "${SCRIPT_DIR}"/install_config.sh
"${SCRIPT_DIR}"/deploy.sh
echo "Finished Deploy!"
EOF Once everything was deployed, I tested modules with manually running the configuration repro tests (instructions here: https://github.com/ACCESS-NRI/model-config-tests/?tab=readme-ov-file#how-to-run-pytests-manually-on-nci), with module load With the workflows, in One thing that should be edited if deployed to Gadi, should be the |
This PR has some initial work setting up containerized
squashfs
conda environments for payu using the work done by Dale for thehh5
's analysis conda containerised environments (https://github.com/coecms/cms-conda-singularity). As a quick overview this PR:main
payu branch on Github).What's cool with the
cms-conda-singularity
scripts is that it already works out of the box with building Python virtual environments on top of the squashfs conda environments. This is useful for the Repro CI tests run using payu inmodel-config-tests
(https://github.com/ACCESS-NRI/model-config-tests/). So I was able to use virtual environments to run the reproducibility tests for an ACCESS-OM2 configuration (tag:release-1deg_jra55_ryf-2.0
) and an ACCESS-ESM1.5 configuration (tag:release-historical+concentrations-1.1
) using payu and payu-dev as a base conda environments and everything passed.I've manually run the scripts in the workflows for building, testing and deploying environments (latest installs are in
/g/data/tm70/jb4202/tmp-conda/
). I am holding off running any CI deployment to Gadi workflows until installation paths and variables are finalised.Notes:
Base installation paths need to be in
/g/data/
:Initially, I was running into errors when manually running the build scripts with directories not existing and squashfs image not correctly being set up. The reason was that I was using
/scratch
directories as base directories (e.g.CONDA_BASE
), rather than/g/data/
- The build scripts assume the base directories where the environments will eventually be deployed to start with/g
.Pip installed packages:
Existing payu development environments install payu from the main branch. Pip-installed packages had incorrect shebang headers pointing to a directory on
/jobfs/
where the environment was initially built. There is already an issue for this: Issue with deployment of pip installed python packages with command line tools MED-condaenv#78. I used Romain's solution here: https://github.com/ACCESS-NRI/MED-condaenv/blob/2c0f730b54cfa6a19b6df4300f8dd27cf3b877d0/environments/esmvaltool/build_inner.sh#L9Payu PBS qsub calls:
Payu submits jobs similar to
qsub -- path/to/env/python path/to/env/payu-run
(when running the commandpayu run
). Thispath/to/env/python
would point to a Python executable only accessible inside the container. Each of the environment commands in the container has a corresponding script outside the container (symlink tolauncher.sh
), that would launch the container and then run the command inside the container. I noticed when testing theconda_concept/analysis
modules in/g/data/hh5/public/modules/
, running the launcher python script with a payu command would have asys.executable
that points back to launcher python script. So running/g/data/hh5/public/apps/cms_conda_scripts/analysis3-24.04.d/bin/python /g/data/hh5/public/apps/cms_conda/envs/analysis3-24.04/bin/payu run
, would pass the launcher python script along to subsequent payu qsub submits. So for a somewhat hacky fix, I modified the Python shebang for the payu command to use the outside Python launcher script. (Why does thesys.executable
point to the Python launcher script? I think becauselauncher.sh
preserves the originalargv[0]
by usingexec -a
, e.g.exec -a /path/to/outside/python /path/to/inner-env/python /path/to/inner-env/payu-run
)An alternative solution to the above would be to modify the payu source code to add the launcher script to the qsub commands. E.g.
This approach is hard-coding a custom environment variable into payu - though it might make it easier for others to run payu inside a container as they will only need the
LAUNCHER_SCRIPT
environment variable to be defined. However, I am not sure how to guarantee this variable points to the correct script that launches the container that contains the payu environment.After chatting with Aidan, another solution would be if (when) Payu ends up using HPCPY (https://github.com/ACCESS-NRI/hpcpy) and payu had a templated script that runs qsub calls. The build scripts in this repository could modify that template, to add in the launcher script. There are also existing override command scripts in this repository so there probably is another solution to this problem.. In the meantime, while I am testing, I'm using the modified shebang header for payu commands as it doesn't require changes to payu.
Github Environment Variables:
@aidanheerdegen suggested moving the project-specific installation paths to Github where they can be set via Github Environment Variables. This is so paths can be changed without modifying the source code. Initially, I moved just the
ADMIN_DIR
(base directory for logs and staging environments tar files), andCONDA_BASE
(base directory which will contain theapps/
andmodules/
subdirectories). As the paths may also impact other configuration settings, e.g. project and storage flags passed to build qsub calls, and the groups used for configuring file permissions of admin and deployed directories (APPS_USERS_GROUP
andAPPS_OWNERS_GROUP
). So I moved those also to Github Variables.Proposed Github Variable settings for Gadi environment:
CONDA_BASE
:/g/data/vk83/prerelease
(the directory that contains apps/ and modules/ subdirectories)ADMIN_DIR
:/g/data/vk83/admin/conda_containers/prerelease
(directory to store staging and log files, tar files of conda environments, and backups of old environment squashfs files)APPS_USERS_GROUP
:vk83
(Permissions of read and execute for files installed to apps and modules)APPS_OWNERS_GROUP
:vk83_w
? (Read/write/executable permissions for installed files)PROJECT
:tm70
(Build and test PBS jobs project)STORAGE
:gdata/vk83
(Build and test PBS jobs storage directives)secrets.REPO_PATH
: ? (This is the path where all this repository is rsynced to and all the scripts are run from)The above settings,
install_config.sh
settings, and the current conda environments would add the following to/g/data/vk83/prerelease/
:So loading the modules would be
I've named the micromamba install directory
base_conda
and module namecontainer_container
so it does not clash with existingconda/
directories invk83
.Issues: (TODO: split off into separate Github Issues)
payu-dev
environment #2)environment/config.sh
that removed "openssh-clients", "openssh-server" and "openssh" from the environment, and include an outside "ssh" command. In the cms documentation for the conda environments (https://climate-cms.org/cms-wiki/resources/resources-conda-setup.html#technical-details), has "As a part of the installation process, the openssh packages are removed from the conda installation, which forces use of the system ssh and, more importantly, its configuration." So I am wondering if I will accidentally break something by removing those.Setup
,Build
andTest
jobs. As the settings for Gadi environment requires reviewers, this will require many signoffs in a Pull Request. This is fine for testing stage as can run through the logs, and manually check things between each step but might be unnecessary later on. Could move jobs into one job so it only requires one sign off to deploy to Gadi?