Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial setup of Payu environment #1

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Conversation

jo-basevi
Copy link
Collaborator

@jo-basevi jo-basevi commented Sep 26, 2024

This PR has some initial work setting up containerized squashfs conda environments for payu using the work done by Dale for the hh5's analysis conda containerised environments (https://github.com/coecms/cms-conda-singularity). As a quick overview this PR:

  • Adds payu and payu-dev environment configuration (payu-dev is similar except payu is installed using pip using the main payu branch on Github).
  • Replace existing ssh related third-party actions in workflows with ACCESS-NRI/actions
  • Move some project-specific configurations out of the scripts into GitHub variables.

What's cool with the cms-conda-singularity scripts is that it already works out of the box with building Python virtual environments on top of the squashfs conda environments. This is useful for the Repro CI tests run using payu in model-config-tests(https://github.com/ACCESS-NRI/model-config-tests/). So I was able to use virtual environments to run the reproducibility tests for an ACCESS-OM2 configuration (tag: release-1deg_jra55_ryf-2.0) and an ACCESS-ESM1.5 configuration (tag: release-historical+concentrations-1.1) using payu and payu-dev as a base conda environments and everything passed.

I've manually run the scripts in the workflows for building, testing and deploying environments (latest installs are in /g/data/tm70/jb4202/tmp-conda/). I am holding off running any CI deployment to Gadi workflows until installation paths and variables are finalised.

Notes:

  • Base installation paths need to be in /g/data/:
    Initially, I was running into errors when manually running the build scripts with directories not existing and squashfs image not correctly being set up. The reason was that I was using /scratch directories as base directories (e.g. CONDA_BASE), rather than /g/data/ - The build scripts assume the base directories where the environments will eventually be deployed to start with /g.

  • Pip installed packages:
    Existing payu development environments install payu from the main branch. Pip-installed packages had incorrect shebang headers pointing to a directory on /jobfs/ where the environment was initially built. There is already an issue for this: Issue with deployment of pip installed python packages with command line tools MED-condaenv#78. I used Romain's solution here: https://github.com/ACCESS-NRI/MED-condaenv/blob/2c0f730b54cfa6a19b6df4300f8dd27cf3b877d0/environments/esmvaltool/build_inner.sh#L9

  • Payu PBS qsub calls:
    Payu submits jobs similar to qsub -- path/to/env/python path/to/env/payu-run (when running the command payu run). This path/to/env/python would point to a Python executable only accessible inside the container. Each of the environment commands in the container has a corresponding script outside the container (symlink to launcher.sh), that would launch the container and then run the command inside the container. I noticed when testing the conda_concept/analysis modules in /g/data/hh5/public/modules/, running the launcher python script with a payu command would have a sys.executable that points back to launcher python script. So running/g/data/hh5/public/apps/cms_conda_scripts/analysis3-24.04.d/bin/python /g/data/hh5/public/apps/cms_conda/envs/analysis3-24.04/bin/payu run, would pass the launcher python script along to subsequent payu qsub submits. So for a somewhat hacky fix, I modified the Python shebang for the payu command to use the outside Python launcher script. (Why does the sys.executable point to the Python launcher script? I think because launcher.sh preserves the original argv[0] by using exec -a, e.g. exec -a /path/to/outside/python /path/to/inner-env/python /path/to/inner-env/payu-run)

    An alternative solution to the above would be to modify the payu source code to add the launcher script to the qsub commands. E.g.

    Checks if inside a container (if a SINGULARITY environment variable is set - e.g. `SINGULARITY_ENVIRONMENT` `SINGULARITY_NAME`)
    Check if there is a `LAUNCHER_SCRIPT` environment variable (this might be specific only to these environments)
    `qsub -- $LAUNCHER_SCRIPT path/to/env/bin/python path/to/env/bin/payu-run`
    

    This approach is hard-coding a custom environment variable into payu - though it might make it easier for others to run payu inside a container as they will only need the LAUNCHER_SCRIPT environment variable to be defined. However, I am not sure how to guarantee this variable points to the correct script that launches the container that contains the payu environment.

    After chatting with Aidan, another solution would be if (when) Payu ends up using HPCPY (https://github.com/ACCESS-NRI/hpcpy) and payu had a templated script that runs qsub calls. The build scripts in this repository could modify that template, to add in the launcher script. There are also existing override command scripts in this repository so there probably is another solution to this problem.. In the meantime, while I am testing, I'm using the modified shebang header for payu commands as it doesn't require changes to payu.

  • Github Environment Variables:
    @aidanheerdegen suggested moving the project-specific installation paths to Github where they can be set via Github Environment Variables. This is so paths can be changed without modifying the source code. Initially, I moved just the ADMIN_DIR (base directory for logs and staging environments tar files), and CONDA_BASE (base directory which will contain the apps/ and modules/ subdirectories). As the paths may also impact other configuration settings, e.g. project and storage flags passed to build qsub calls, and the groups used for configuring file permissions of admin and deployed directories (APPS_USERS_GROUP and APPS_OWNERS_GROUP). So I moved those also to Github Variables.

    Proposed Github Variable settings for Gadi environment:

    • CONDA_BASE: /g/data/vk83/prerelease (the directory that contains apps/ and modules/ subdirectories)
    • ADMIN_DIR: /g/data/vk83/admin/conda_containers/prerelease (directory to store staging and log files, tar files of conda environments, and backups of old environment squashfs files)
    • APPS_USERS_GROUP: vk83 (Permissions of read and execute for files installed to apps and modules)
    • APPS_OWNERS_GROUP: vk83_w ? (Read/write/executable permissions for installed files)
    • PROJECT: tm70 (Build and test PBS jobs project)
    • STORAGE: gdata/vk83 (Build and test PBS jobs storage directives)
    • secrets.REPO_PATH: ? (This is the path where all this repository is rsynced to and all the scripts are run from)

    The above settings, install_config.sh settings, and the current conda environments would add the following to /g/data/vk83/prerelease/:

    ├── apps
    │   ├── base_conda
    │   │   ├── bin
    │   │   │   └── micromamba
    │   │   ├── envs
    │   │   │   ├── payu -> payu-1.1.5
    │   │   │   ├── payu-1.1.5 -> /opt/conda/payu-1.1.5
    │   │   │   ├── payu-1.1.5.sqsh
    │   │   │   ├── payu-dev -> /opt/conda/payu-dev
    │   │   │   ├── payu-dev.sqsh
    │   │   │   └── payu-unstable -> payu-1.1.5
    │   │   └── etc
    │   │       └── base.sif
    │   └── conda_scripts
    │       ├── launcher_conf.sh
    │       ├── launcher.sh
    │       ├── overrides
    │       │   ├── functions.sh
    │       │   ├── jupyter.config.sh
    │       │   ├── mpicc.config.sh
    │       │   ├── pbs_tmrsh.sh
    │       │   └── ssh.sh
    │       ├── payu-1.1.5.d
    │       │   ├── bin
                    # Launch script symlinks (I've left them out here), e.g payu -> launcher.sh, python3 -> launcher.sh
    │       │   │   ├── launcher_conf.sh
    │       │   │   ├── launcher.sh
    │       │   └── overrides
    │       │       ├── functions.sh -> ../../overrides/functions.sh
    │       │       ├── jupyter.config.sh -> ../../overrides/jupyter.config.sh
    │       │       ├── mpicc.config.sh -> ../../overrides/mpicc.config.sh
    │       │       ├── pbs_tmrsh.sh -> ../../overrides/pbs_tmrsh.sh
    │       │       └── ssh.sh -> ../../overrides/ssh.sh
    │       ├── payu.d -> payu-1.1.5.d
    │       ├── payu-dev.d
    │       │   ├── bin
                    # Launch script symlinks (I've left them out here), e.g payu -> launcher.sh
    │       │   │   ├── launcher_conf.sh
    │       │   │   ├── launcher.sh
    │       │   └── overrides
    │       │       ├── functions.sh -> ../../overrides/functions.sh
    │       │       ├── jupyter.config.sh -> ../../overrides/jupyter.config.sh
    │       │       ├── mpicc.config.sh -> ../../overrides/mpicc.config.sh
    │       │       ├── pbs_tmrsh.sh -> ../../overrides/pbs_tmrsh.sh
    │       │       └── ssh.sh -> ../../overrides/ssh.sh
    │       └── payu-unstable.d -> payu-1.1.5.d
    └── modules
        └── conda_container
        ├── payu-1.1.5 -> .common_v3
        └── payu-dev -> .common_v3
    

    So loading the modules would be

    module load /g/data/vk83/prerelease/modules
    module load conda_container/payu # or conda_container/payu-1.1.5 or conda_container/payu-dev
    

    I've named the micromamba install directory base_conda and module name container_container so it does not clash with existing conda/ directories in vk83.

Issues: (TODO: split off into separate Github Issues)

  • Different locations for release and pre-release environments and automatic payu-dev updates (see Automatic updates to payu-dev environment #2)
  • Investigate using conda-pack environments similarly to workflows in https://github.com/ACCESS-NRI/payu-condaenv. Would using conda-pack environments simplify things or not?
  • Process for deprecating and deleting old payu environments?
  • Switched to installing an official micromamba when a pre-existing environment does not exist (see Using official Micromamba install #3)
  • To get git signing working I removed the settings in environment/config.sh that removed "openssh-clients", "openssh-server" and "openssh" from the environment, and include an outside "ssh" command. In the cms documentation for the conda environments (https://climate-cms.org/cms-wiki/resources/resources-conda-setup.html#technical-details), has "As a part of the installation process, the openssh packages are removed from the conda installation, which forces use of the system ssh and, more importantly, its configuration." So I am wondering if I will accidentally break something by removing those.
  • Workflows: Github deployment to Gadi environment is triggered at Setup, Build and Test jobs. As the settings for Gadi environment requires reviewers, this will require many signoffs in a Pull Request. This is fine for testing stage as can run through the logs, and manually check things between each step but might be unnecessary later on. Could move jobs into one job so it only requires one sign off to deploy to Gadi?

@jo-basevi jo-basevi marked this pull request as ready for review October 7, 2024 06:28
@CodeGat CodeGat self-requested a review October 8, 2024 22:38
@CodeGat
Copy link

CodeGat commented Oct 10, 2024

[The deployment] will require many signoffs in a Pull Request [...] Could move jobs into one job so it only requires one sign off to deploy to Gadi?

This is what I've tried to do with build-cd - it is annoying having so much logic contained in the job, but what can ya do...

Copy link

@CodeGat CodeGat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some comments, I'll do a more full review later :)

environments/payu-dev/build_inner.sh Show resolved Hide resolved
environments/payu-dev/config.sh Show resolved Hide resolved
NEXT_STABLE="${ENVIRONMENT}-${STABLE_VERSION}"
CURRENT_UNSTABLE=$( get_aliased_module "${MODULE_NAME}"/analysis3-unstable "${CONDA_MODULE_PATH}" )
CURRENT_UNSTABLE=$( get_aliased_module "${MODULE_NAME}"/payu-unstable "${CONDA_MODULE_PATH}" )
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this payu-dev rather than payu-unstable..?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I might actually remove payu-unstable module alias as the separate payu-dev environment and module is a way to test the payu environment.

Comment on lines +87 to +88
- name: Checkout repository
uses: actions/checkout@v4
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the things that I'm not too sure about when using the relative uses: ./.github/workflows/thingo.yml as opposed to the version-specific uses: access-nri/model-release-condaenv/.github/workflows/thingo.yml@main is that in the case where it is on.pull_request - the checkout that you do here means that the version of the deploy workflow that you use here would be the pull request version, rather than the @main version, for example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the relative paths were useful to the use the test the changes in this PR. But maybe if the build and test workflows work, the workflow changes should be separated into a different PR and merged first, with the versioned values @main?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that will work :)

@jo-basevi
Copy link
Collaborator Author

Just some more quick details on how it's been tested. I manually ran all the build/test/deploy commands in the workflows. Added the commands here for reference. The pbs logs for the build/test scripts are in $JOB_LOG_DIR (/g/data/tm70/jb4202/tmp-conda/admin/conda_containers/logs for the manual tests). In the tests I used a REPO_PATH in my home directory which contains a built base container file (container/base.sif). I've just rsynced it to here: /g/data/tm70/jb4202/tmp-conda/model-release-condaenv for reference. To run commands for payu-dev, use CONDA_ENVIRONMENT="payu-dev"

Setup command
bash << 'EOF'
set -e
REPO_PATH=/home/189/jb4202/model-release-condaenv
export ADMIN_DIR="/g/data/tm70/jb4202/tmp-conda/admin/conda_containers"
export CONDA_BASE="/g/data/tm70/jb4202/tmp-conda/prerelease"
export APPS_USERS_GROUP="tm70"
export APPS_OWNERS_GROUP="tm70"

source "$REPO_PATH/scripts/install_config.sh"
source "$REPO_PATH/scripts/functions.sh"
mkdir -p "${ADMIN_DIR}" "${JOB_LOG_DIR}" "${BUILD_STAGE_DIR}"
set_admin_perms "${ADMIN_DIR}" "${JOB_LOG_DIR}" "${BUILD_STAGE_DIR}"

echo "${ADMIN_DIR}" "${CONDA_BASE}" "${JOB_LOG_DIR}"
echo "Finished setup!"
EOF
Build command
bash << 'EOF'
set -e
REPO_PATH=/home/189/jb4202/model-release-condaenv
export SCRIPT_DIR="$REPO_PATH/scripts"
export CONDA_ENVIRONMENT="payu"
export ADMIN_DIR="/g/data/tm70/jb4202/tmp-conda/admin/conda_containers"
export CONDA_BASE="/g/data/tm70/jb4202/tmp-conda/prerelease"
export APPS_USERS_GROUP="tm70"
export APPS_OWNERS_GROUP="tm70"
PROJECT="tm70"
STORAGE="gdata/tm70"

source "${SCRIPT_DIR}"/install_config.sh
cd "${JOB_LOG_DIR}"

qsub -N build_"${CONDA_ENVIRONMENT}" -lncpus=1,mem=20GB,walltime=2:00:00,jobfs=50GB,storage="${STORAGE}" \
           -v SCRIPT_DIR,CONDA_ENVIRONMENT,ADMIN_DIR,CONDA_BASE,APPS_USERS_GROUP,APPS_OWNERS_GROUP \
           -P "${PROJECT}" -q copyq -Wblock=true -Wumask=037 \
           "${SCRIPT_DIR}"/build.sh

echo "Finished Build!"
EOF
Test command
bash << 'EOF'
set -e
REPO_PATH=/home/189/jb4202/model-release-condaenv
export SCRIPT_DIR="$REPO_PATH/scripts"
export CONDA_ENVIRONMENT="payu"
export ADMIN_DIR="/g/data/tm70/jb4202/tmp-conda/admin/conda_containers"
export CONDA_BASE="/g/data/tm70/jb4202/tmp-conda/prerelease"
export APPS_USERS_GROUP="tm70"
export APPS_OWNERS_GROUP="tm70"
PROJECT="tm70"
STORAGE="gdata/tm70"

source "${SCRIPT_DIR}"/install_config.sh
cd "${JOB_LOG_DIR}"

qsub -N test_"${CONDA_ENVIRONMENT}" -lncpus=4,mem=20GB,walltime=0:20:00,jobfs=50GB,storage="${STORAGE}" \
           -v SCRIPT_DIR,CONDA_ENVIRONMENT,ADMIN_DIR,CONDA_BASE,APPS_USERS_GROUP,APPS_OWNERS_GROUP \
           -P "${PROJECT}" -Wblock=true -Wumask=037 \
           "${SCRIPT_DIR}"/test.sh

echo "Finished Test!"
EOF
Deploy command
bash << 'EOF'
set -e
REPO_PATH=/home/189/jb4202/model-release-condaenv
export SCRIPT_DIR="$REPO_PATH/scripts"
export CONDA_ENVIRONMENT="payu"
export ADMIN_DIR="/g/data/tm70/jb4202/tmp-conda/admin/conda_containers"
export CONDA_BASE="/g/data/tm70/jb4202/tmp-conda/prerelease"
export APPS_USERS_GROUP="tm70"
export APPS_OWNERS_GROUP="tm70"

source "${SCRIPT_DIR}"/install_config.sh

"${SCRIPT_DIR}"/deploy.sh

echo "Finished Deploy!"
EOF

Once everything was deployed, I tested modules with manually running the configuration repro tests (instructions here: https://github.com/ACCESS-NRI/model-config-tests/?tab=readme-ov-file#how-to-run-pytests-manually-on-nci), with module load conda/payu-1.1.5. (Tested an ACCESS-OM2 Configuration (tag: release-1deg_jra55_ryf-2.0) and an ACCESS-ESM1.5 configuration (tag: release-historical+concentrations-1.1)). Also tested the payu commands run fine when running directly. Similar to the above with testing payu-dev environments.

With the workflows, in pull_request.yml, I've tested the build_base_image job which builds the container.sif and upload/download artefact on a private test repository. What has not been tested really is the Github vars are all correctly set and used. A test organisation probably wouldn't be a bad idea to check those.

One thing that should be edited if deployed to Gadi, should be the secrets.REPO_PATH to maybe some temporary directory in CI user home directory or scratch (If scratch the storage flags (vars.STORAGE) for pbs jobs might need to include those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants