-
Notifications
You must be signed in to change notification settings - Fork 33
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into feature/acronyms
- Loading branch information
Showing
9 changed files
with
684 additions
and
801 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# ALCF Users Slack Workspace | ||
|
||
The ALCF Users Slack workspace is a platform intended for current, active ALCF users[^1] where the user community can interact, collaborate, and help one another. This workspace is a pilot program valid for 1 year. | ||
This workspace is a platform for the user community to interact and collaborate. ALCF staff may have a limited presence or not be present at all times to assist users with their queries. **The official mechanism for requesting support is to email [[email protected]](mailto:[email protected]) for technical issues and [[email protected]](mailto:[email protected]) for account and access-related issues.** | ||
|
||
## Getting access | ||
|
||
ALCF Users Slack access is automatically provided to members of INCITE, ALCC, and ESP projects. Certain (**see footnote**) DD projects[^1] can request Slack access for their teams by submitting a support ticket to [[email protected]](mailto:[email protected]). | ||
|
||
[^1]: ALCF users working on scientific or research campaigns across various allocation types are eligible for access to the ALCF Users Slack workspace. Users participating in training events, instructional courses, or lighthouse projects are **not** eligible for User slack access. | ||
|
||
## Logging into Slack | ||
|
||
ALCF Users Slack uses ALCF credentials for access. Active ALCF users who have been granted access should log in here: [https://alcf-users.slack.com](https://alcf-users.slack.com). | ||
|
||
## Using Slack Channels | ||
|
||
Once you are logged into the Slack workspace, your default channels will show up in the navigation pane. All system-specific announcements will be published on the `#announcements` channel. You can browse and join existing public channels or create private channels for your project. You can send direct messages to your collaborators. While you cannot create public channels, you can email [[email protected]](mailto:[email protected]) to request for one. | ||
|
||
The ALCF User Slack workspace should not be used to discuss, store, or operate any NDA/RSNDA, Official Use Only (OUO), or Business Sensitive information. | ||
|
||
Argonne's **[code of conduct](https://www.alcf.anl.gov/about/code-of-conduct)** governs the use of all ALCF resources. By joining and using the workspace, you agree to these terms. Users violating Argonne's code of conduct may be removed from the workspace. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,46 +1,61 @@ | ||
# Containers on Polaris | ||
Polaris, powered by NVIDIA A100 GPUs, benefits from container-based workloads for seamless compatibility across NVIDIA systems. This guide details the use of containers on Polaris, including custom container creation, large-scale execution, and common pitfalls. | ||
|
||
Polaris, equipped with NVIDIA A100 GPUs, leverages container-based workloads for seamless compatibility across NVIDIA systems. This guide provides detailed instructions on using containers on Polaris, including setup, container creation, large-scale execution, and troubleshooting common issues. | ||
|
||
## Apptainer Setup | ||
|
||
Polaris employs Apptainer (formerly known as Singularity) for container management. To set up Apptainer, run: | ||
Polaris uses Apptainer (formerly Singularity) for container management. Request a compute node as follows: | ||
|
||
```bash | ||
ml use /soft/modulefiles | ||
ml load spack-pe-base/0.8.1 | ||
ml load apptainer | ||
ml load e2fsprogs | ||
apptainer version #1.2.2 | ||
qsub -I -A <PROJECT_NAME> -q debug -l select=1 -l walltime=01:00:00 -l filesystems=home:grand:eagle -l singularity_fakeroot=true # Debug queue for 1 hour | ||
``` | ||
|
||
The Apptainer version on Polaris is 1.2.2. Detailed user documentation is available [here](https://apptainer.org/docs/user/1.2/). | ||
After connecting to the compute node, load Apptainer and necessary modules: | ||
|
||
## Building from Docker or Argonne GitHub Container Registry | ||
```bash | ||
ml use /soft/modulefiles | ||
ml spack-pe-base/0.8.1 | ||
ml use /soft/spack/testing/0.8.1/modulefiles | ||
ml apptainer/main | ||
ml load e2fsprogs | ||
|
||
Containers on Polaris can be built by writing Dockerfiles on a local machine and then publishing the container to DockerHub, or by directly building them on an ALCF compute node by writing an Apptainer recipe file. If you prefer to use existing containers, you can pull them from various registries like DockerHub and run them on Polaris. | ||
export BASE_SCRATCH_DIR=/local/scratch/ # For Polaris | ||
export APPTAINER_TMPDIR=$BASE_SCRATCH_DIR/apptainer-tmpdir | ||
mkdir -p $APPTAINER_TMPDIR | ||
|
||
Since Docker requires root privileges, which users do not have on Polaris, existing Docker containers must be converted to Apptainer. To build a Docker-based container on Polaris, use the following as an example: | ||
export APPTAINER_CACHEDIR=$BASE_SCRATCH_DIR/apptainer-cachedir | ||
mkdir -p $APPTAINER_CACHEDIR | ||
|
||
```bash | ||
qsub -I -A <Project> -q debug -l select=1 -l walltime=01:00:00 -l filesystems=home:eagle -l singularity_fakeroot=true | ||
# For internet access | ||
export HTTP_PROXY=http://proxy.alcf.anl.gov:3128 | ||
export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128 | ||
export http_proxy=http://proxy.alcf.anl.gov:3128 | ||
export https_proxy=http://proxy.alcf.anl.gov:3128 | ||
ml use /soft/modulefiles | ||
ml load spack-pe-base/0.8.1 | ||
ml load apptainer | ||
ml load e2fsprogs | ||
|
||
apptainer version # should return 1.4.0-rc.1+24-g6ae1a25f2 | ||
``` | ||
|
||
Detailed Apptainer documentation is available [here](https://apptainer.org/docs/user/latest/). | ||
|
||
## Building Containers from Docker or Argonne GitHub Container Registry | ||
|
||
Containers can be built by: | ||
- Creating Dockerfiles locally and publishing to DockerHub, then converting to Apptainer. | ||
- Building directly on ALCF nodes using Apptainer recipe files. | ||
|
||
To convert a Docker container to Apptainer on Polaris, use: | ||
|
||
```bash | ||
apptainer build --fakeroot pytorch:22.06-py3.sing docker://nvcr.io/nvidia/pytorch:22.06-py3 | ||
``` | ||
|
||
You can find the latest prebuilt Nvidia PyTorch containers [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). The TensorFlow containers are [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow) (though note that LCF doesn't typically prebuild the TF-1 containers). You can search the full container registry [here](https://catalog.ngc.nvidia.com/containers). For custom containers tailored for Polaris, visit [ALCF's GitHub container registry](https://github.com/argonne-lcf/container-registry/tree/main). | ||
Find prebuilt NVIDIA PyTorch containers [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). TensorFlow containers are [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow). Search the full container registry [here](https://catalog.ngc.nvidia.com/containers). Custom containers tailored for Polaris are available in [ALCF's GitHub container registry](https://github.com/argonne-lcf/container-registry/tree/main). | ||
|
||
> **Note:** Currently, container build and executions are only supported on the Polaris compute nodes. | ||
> **Note:** Container build and execution are only supported on Polaris compute nodes. | ||
## Running Containers on Polaris | ||
|
||
To run a container on Polaris, you can use the submission script described [here](https://github.com/argonne-lcf/container-registry/blob/main/containers/mpi/Polaris/job_submission.sh). Below is an explanation of the job submission script. | ||
Use the submission script detailed [here](https://github.com/argonne-lcf/container-registry/blob/main/containers/mpi/Polaris/job_submission.sh). Example job script: | ||
|
||
```bash | ||
#!/bin/sh | ||
|
@@ -51,124 +66,65 @@ To run a container on Polaris, you can use the submission script described [here | |
#PBS -l filesystems=home:eagle | ||
#PBS -A <project_name> | ||
cd ${PBS_O_WORKDIR} | ||
echo $CONTAINER | ||
``` | ||
|
||
We move to the current working directory and enable network access at runtime by setting the proxy. We also load Apptainer. | ||
|
||
```bash | ||
# SET proxy for internet access | ||
ml use /soft/modulefiles | ||
ml load spack-pe-base/0.8.1 | ||
ml load apptainer | ||
ml spack-pe-base/0.8.1 | ||
ml use /soft/spack/testing/0.8.1/modulefiles | ||
ml apptainer/main | ||
ml load e2fsprogs | ||
|
||
export BASE_SCRATCH_DIR=/local/scratch/ | ||
export APPTAINER_TMPDIR=$BASE_SCRATCH_DIR/apptainer-tmpdir | ||
mkdir -p $APPTAINER_TMPDIR | ||
export APPTAINER_CACHEDIR=$BASE_SCRATCH_DIR/apptainer-cachedir | ||
mkdir -p $APPTAINER_CACHEDIR | ||
|
||
# Proxy setup for internet access | ||
export HTTP_PROXY=http://proxy.alcf.anl.gov:3128 | ||
export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128 | ||
export http_proxy=http://proxy.alcf.anl.gov:3128 | ||
export https_proxy=http://proxy.alcf.anl.gov:3128 | ||
``` | ||
|
||
This is important for the system (Polaris - Cray) mpich to bind to the container's mpich. Set the following environment variables: | ||
|
||
```bash | ||
ADDITIONAL_PATH=/opt/cray/pe/pals/1.2.12/lib | ||
# Environment variables for MPI | ||
export ADDITIONAL_PATH=/opt/cray/pe/pals/1.2.12/lib | ||
module load cray-mpich-abi | ||
export APPTAINERENV_LD_LIBRARY_PATH="$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH:$ADDITIONAL_PATH" | ||
``` | ||
|
||
Set the number of ranks per node spread as per your scaling requirements: | ||
|
||
```bash | ||
# MPI example w/ 16 MPI ranks per node spread evenly across cores | ||
NODES=`wc -l < $PBS_NODEFILE` | ||
# Set MPI ranks | ||
NODES=$(wc -l < $PBS_NODEFILE) | ||
PPN=16 | ||
PROCS=$((NODES * PPN)) | ||
echo "NUM_OF_NODES= ${NODES} TOTAL_NUM_RANKS= ${PROCS} RANKS_PER_NODE= ${PPN}" | ||
``` | ||
echo "NUM_OF_NODES=${NODES}, TOTAL_NUM_RANKS=${PROCS}, RANKS_PER_NODE=${PPN}" | ||
|
||
Finally, launch your script: | ||
# Launch the container | ||
mpiexec -hostfile $PBS_NODEFILE -n $PROCS -ppn $PPN apptainer exec --fakeroot -B /opt -B /var/run/palsd/ $CONTAINER /usr/source/mpi_hello_world | ||
|
||
```bash | ||
echo C++ MPI | ||
mpiexec -hostfile $PBS_NODEFILE -n $PROCS -ppn $PPN apptainer exec -B /opt -B /var/run/palsd/ $CONTAINER /usr/source/mpi_hello_world | ||
|
||
echo Python MPI | ||
mpiexec -hostfile $PBS_NODEFILE -n $PROCS -ppn $PPN apptainer exec -B /opt -B /var/run/palsd/ $CONTAINER python3 /usr/source/mpi_hello_world.py | ||
# Python example | ||
mpiexec -hostfile $PBS_NODEFILE -n $PROCS -ppn $PPN apptainer exec --fakeroot -B /opt -B /var/run/palsd/ $CONTAINER python3 /usr/source/mpi_hello_world.py | ||
``` | ||
|
||
The job can be submitted using: | ||
Submit jobs using: | ||
|
||
```bash | ||
qsub -v CONTAINER=mpich-4_latest.sif job_submission.sh | ||
``` | ||
|
||
<!-- --8<-- [start:commoncontainerdoc] --> | ||
|
||
## Recipe-Based Container Building | ||
## Available Containers | ||
|
||
As mentioned earlier, you can build Apptainer containers from recipe files. Instructions are available [here](https://apptainer.org/docs/user/1.2/build_a_container.html#building-containers-from-apptainer-definition-files). See [available containers](#available-containers) for more recipes. | ||
- Examples for running MPICH containers can be found [here](https://github.com/argonne-lcf/container-registry/tree/main/containers/mpi/Polaris). | ||
- Examples for running databases can be found [here](https://github.com/argonne-lcf/container-registry/tree/main/containers/databases). | ||
- Examples for using SHPC (containers as modules) can be found [here](https://github.com/argonne-lcf/container-registry/tree/main/containers/shpc). | ||
|
||
> Note: You can also build custom recipes by bootstrapping from prebuilt images. For example, the first two lines in a recipe to use our custom TensorFlow implementation would be `Bootstrap: oras` followed by `From: ghcr.io/argonne-lcf/tf2-mpich-nvidia-gpu:latest`. | ||
## Available containers | ||
|
||
If you just want to know what containers are available, here you go: | ||
|
||
* Examples for running MPICH containers can be found [here](https://github.com/argonne-lcf/container-registry/tree/main/containers/mpi/Polaris). | ||
|
||
* Examples for running databases can be found [here](https://github.com/argonne-lcf/container-registry/tree/main/containers/databases). | ||
|
||
* For using shpc - that allows for running containers as modules. It can be found [here](https://github.com/argonne-lcf/container-registry/tree/main/containers/shpc). | ||
|
||
The latest containers are updated periodically. If you have trouble using containers or request a newer or a different container, please contact ALCF support at `[email protected]`. | ||
|
||
## Troubleshooting Common Issues | ||
|
||
**Permission Denied Error**: If you encounter permission errors during the build: | ||
|
||
* Check your quota and delete any unnecessary files. | ||
|
||
* Clean up the Apptainer cache, `~/.apptainer/cache`, and set the Apptainer tmp and cache directories as below. If your home directory is full and if you are building your container on a compute node, then set the tmpdir and cachedir to local scratch: | ||
|
||
```bash | ||
export BASE_SCRATCH_DIR=/local/scratch/ # FOR POLARIS | ||
#export BASE_SCRATCH_DIR=/raid/scratch/ # FOR SOPHIA | ||
export APPTAINER_TMPDIR=$BASE_SCRATCH_DIR/apptainer-tmpdir | ||
mkdir $APPTAINER_TMPDIR | ||
export APPTAINER_CACHEDIR=$BASE_SCRATCH_DIR/apptainer-cachedir/ | ||
mkdir $APPTAINER_CACHEDIR | ||
``` | ||
|
||
* Make sure you are not in a directory accessed with a symbolic link, i.e., check if `pwd` and `pwd -P` return the same path. | ||
|
||
* If any of the above doesn't work, try running the build in your home directory. | ||
|
||
**Mapping to rank 0 on all nodes**: Ensure that the container's MPI aligns with the system MPI. For example, follow the additional steps outlined in the [container registry documentation for MPI on Polaris](https://github.com/argonne-lcf/container-registry/tree/main/containers/mpi/Polaris). | ||
|
||
**libmpi.so.40 not found**: This can happen if the container's application has an OpenMPI dependency, which is not currently supported on Polaris. It can also spring up if the container's base environment is not a Debian-based architecture such as Ubuntu. Ensure the application has an MPICH implementation as well. Also, try removing `.conda/`, `.cache/`, and `.local/` folders from your home directory and rebuilding the container. | ||
|
||
**Disabled Port mapping, user namespace, and [network virtualization]** [Network virtualization](https://apptainer.org/docs/user/main/networking.html) is disabled for the container due to security constraints. See issue [#2533](https://github.com/apptainer/apptainer/issues/2553). | ||
|
||
!!! bug "Apptainer instance errors with version 1.3.2" | ||
|
||
Use `nohup` and `&` as an alternative if you want to run Apptainer as a background process. See below for an example of running Postgres as a background process: | ||
```bash linenums="1" | ||
nohup apptainer run | ||
-B pgrun:/var/run/postgresql \ | ||
-B pgdata:/var/lib/postgresql/data \ | ||
--env-file pg.env \ | ||
postgres.sing postgres & | ||
|
||
# 3) Capture its PID so we can kill it later | ||
echo $! > postgres_pid.txt | ||
echo "Started Postgres in the background with PID $(cat postgres_pid.txt)" | ||
- **Permission Denied:** Check your quota, clean Apptainer cache (`~/.apptainer/cache`), or set directories to local scratch (`/local/scratch/`). | ||
- **MPI Issues:** Ensure MPI compatibility by following [MPI container registry docs](https://github.com/argonne-lcf/container-registry/tree/main). | ||
- **libmpi.so.40 not found:** Use MPICH-compatible base images. | ||
- **Disabled Network Virtualization:** Network virtualization is disabled due to security constraints ([details](https://apptainer.org/docs/user/main/networking.html)). | ||
- **Starter-suid Error:** Always use the `--fakeroot` flag on Polaris compute nodes. | ||
|
||
# 4) Perform whatever work you need while Postgres is running | ||
# In this demo, we just sleep for 30 minutes (1800 seconds). | ||
sleep 1800 | ||
For further assistance, contact ALCF support: `[email protected]`. | ||
|
||
# 5) Kill the background process at the end of the job | ||
kill "$(cat postgres_pid.txt)" | ||
rm postgres_pid.txt | ||
``` | ||
<!-- --8<-- [end:commoncontainerdoc] --> |
Oops, something went wrong.