Skip to content

Commit

Permalink
Merge branch 'main' into feature/acronyms
Browse files Browse the repository at this point in the history
  • Loading branch information
felker authored Mar 6, 2025
2 parents 971c1d0 + 0cb0e38 commit 5b6fbc6
Show file tree
Hide file tree
Showing 9 changed files with 684 additions and 801 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,19 @@ PIs of INCITE, ALCC, and ADSP projects are required to complete quarterly report

## Due dates

### Due dates for the 2024 INCITE quarterly, EOY, and the EOP reports:
### Due dates for the 2025 INCITE quarterly, EOY, and the EOP reports:

- April 1, 2024 (CY2024 - Q1)
- July 1, 2024 (CY2024 - Q2)
- October 1, 2024 (CY2024 - Q3)
- January 1, 2025 (CY2025 - EOY) or February 15, 2025 (entire allocation period - EOP)
- April 1, 2025 (CY2025 - Q1)
- July 1, 2025 (CY2025 - Q2)
- October 1, 2025 (CY2025 - Q3)
- January 1, 2026 (CY2025 - EOY) or February 15, 2026 (entire allocation period - EOP)

### Due dates for the 2023-2024 ALCC quarterly and the EOP reports:

- October 1, 2023 (CY2023 - Q3)
- January 1, 2024 (CY2024 - Q4)
- April 1, 2024 (CY2024 - Q1)
- August 15, 2024 (CY2024 - EOP)
- October 1, 2024 (CY2024 - Q3)
- January 1, 2025 (CY2025 - Q4)
- April 1, 2024 (CY2025 - Q1)
- August 15, 2025 (CY2025 - EOP)

## Penalties

Expand Down
4 changes: 2 additions & 2 deletions docs/aurora/programming-models/hip-aurora.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# HIP on Aurora

Applications which use the HIP programming model are supported on Aurora via [chipStar](https://github.com/CHIP-SPV/chipStar).
ALCF experimentally supports applications which use the HIP programming model via [chipStar](https://github.com/CHIP-SPV/chipStar).

## Example

Expand Down Expand Up @@ -77,4 +77,4 @@ int main(void)
Max error: 0.000000
```

There are additional details in the [chipStar user documentation](https://github.com/CHIP-SPV/chipStar/blob/main/docs/Using.md).
There are additional details in the [chipStar user documentation](https://github.com/CHIP-SPV/chipStar/blob/main/docs/Using.md).
34 changes: 0 additions & 34 deletions docs/aurora/services/gitlab-ci.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,37 +2,3 @@

### As of Febuary 25 2025, no changes from the [general documentation](https://docs.alcf.anl.gov/services/gitlab-ci/) for GitLab-CI are required for Aurora.


### The below information is *only* for those users who have projects created prior to that date and that still use `gitlab-sunspot.alcf.anl.gov`:


Currently, [https://gitlab-sunspot.alcf.anl.gov](https://gitlab-sunspot.alcf.anl.gov) must be accessed via a proxy.


The following command will connect to an Aurora login node from your local system and establish the required proxy:[^1]
```bash linenums="1"
ssh aurora.alcf.anl.gov -D localhost:25565
```

## Instructions for Firefox Browser

In order to use the proxy, you must configure your local web browser to use a SOCKS proxy. The instructions for other browsers are similar.

1. Open Firefox settings
2. Navigate to "General" > "Network Settings" > "Settings"
<small>(at the bottom of the General settings page.)</small>
3. Ensure "Manual proxy configuration" is selected
4. Fill the "SOCKS Host" field with `localhost`
5. Fill the associated port field with `25565` (or the alternate port you specified in your SSH command)
6. Ensure "SOCKS v5" is selected
7. Ensure "Proxy DNS when using SOCKS v5" is selected"
8. Select "OK"

!!! warning

You will not have internet access in Firefox while using the proxy. Select "No proxy" to re-enable internet access.

For ease of use, many users have had success using extensions like FoxyProxy, or using a separate web browser for accessing resources that require the proxied connection.


[^1]: `25565` is the proxy port, it may be changed as needed.
22 changes: 22 additions & 0 deletions docs/contacting-support/alcf-users-slack.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# ALCF Users Slack Workspace

The ALCF Users Slack workspace is a platform intended for current, active ALCF users[^1] where the user community can interact, collaborate, and help one another. This workspace is a pilot program valid for 1 year.
This workspace is a platform for the user community to interact and collaborate. ALCF staff may have a limited presence or not be present at all times to assist users with their queries. **The official mechanism for requesting support is to email [[email protected]](mailto:[email protected]) for technical issues and [[email protected]](mailto:[email protected]) for account and access-related issues.**

## Getting access

ALCF Users Slack access is automatically provided to members of INCITE, ALCC, and ESP projects. Certain (**see footnote**) DD projects[^1] can request Slack access for their teams by submitting a support ticket to [[email protected]](mailto:[email protected]).

[^1]: ALCF users working on scientific or research campaigns across various allocation types are eligible for access to the ALCF Users Slack workspace. Users participating in training events, instructional courses, or lighthouse projects are **not** eligible for User slack access.

## Logging into Slack

ALCF Users Slack uses ALCF credentials for access. Active ALCF users who have been granted access should log in here: [https://alcf-users.slack.com](https://alcf-users.slack.com).

## Using Slack Channels

Once you are logged into the Slack workspace, your default channels will show up in the navigation pane. All system-specific announcements will be published on the `#announcements` channel. You can browse and join existing public channels or create private channels for your project. You can send direct messages to your collaborators. While you cannot create public channels, you can email [[email protected]](mailto:[email protected]) to request for one.

The ALCF User Slack workspace should not be used to discuss, store, or operate any NDA/RSNDA, Official Use Only (OUO), or Business Sensitive information.

Argonne's **[code of conduct](https://www.alcf.anl.gov/about/code-of-conduct)** governs the use of all ALCF resources. By joining and using the workspace, you agree to these terms. Users violating Argonne's code of conduct may be removed from the workspace.
176 changes: 66 additions & 110 deletions docs/polaris/containers/containers.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,61 @@
# Containers on Polaris
Polaris, powered by NVIDIA A100 GPUs, benefits from container-based workloads for seamless compatibility across NVIDIA systems. This guide details the use of containers on Polaris, including custom container creation, large-scale execution, and common pitfalls.

Polaris, equipped with NVIDIA A100 GPUs, leverages container-based workloads for seamless compatibility across NVIDIA systems. This guide provides detailed instructions on using containers on Polaris, including setup, container creation, large-scale execution, and troubleshooting common issues.

## Apptainer Setup

Polaris employs Apptainer (formerly known as Singularity) for container management. To set up Apptainer, run:
Polaris uses Apptainer (formerly Singularity) for container management. Request a compute node as follows:

```bash
ml use /soft/modulefiles
ml load spack-pe-base/0.8.1
ml load apptainer
ml load e2fsprogs
apptainer version #1.2.2
qsub -I -A <PROJECT_NAME> -q debug -l select=1 -l walltime=01:00:00 -l filesystems=home:grand:eagle -l singularity_fakeroot=true # Debug queue for 1 hour
```

The Apptainer version on Polaris is 1.2.2. Detailed user documentation is available [here](https://apptainer.org/docs/user/1.2/).
After connecting to the compute node, load Apptainer and necessary modules:

## Building from Docker or Argonne GitHub Container Registry
```bash
ml use /soft/modulefiles
ml spack-pe-base/0.8.1
ml use /soft/spack/testing/0.8.1/modulefiles
ml apptainer/main
ml load e2fsprogs

Containers on Polaris can be built by writing Dockerfiles on a local machine and then publishing the container to DockerHub, or by directly building them on an ALCF compute node by writing an Apptainer recipe file. If you prefer to use existing containers, you can pull them from various registries like DockerHub and run them on Polaris.
export BASE_SCRATCH_DIR=/local/scratch/ # For Polaris
export APPTAINER_TMPDIR=$BASE_SCRATCH_DIR/apptainer-tmpdir
mkdir -p $APPTAINER_TMPDIR

Since Docker requires root privileges, which users do not have on Polaris, existing Docker containers must be converted to Apptainer. To build a Docker-based container on Polaris, use the following as an example:
export APPTAINER_CACHEDIR=$BASE_SCRATCH_DIR/apptainer-cachedir
mkdir -p $APPTAINER_CACHEDIR

```bash
qsub -I -A <Project> -q debug -l select=1 -l walltime=01:00:00 -l filesystems=home:eagle -l singularity_fakeroot=true
# For internet access
export HTTP_PROXY=http://proxy.alcf.anl.gov:3128
export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128
export http_proxy=http://proxy.alcf.anl.gov:3128
export https_proxy=http://proxy.alcf.anl.gov:3128
ml use /soft/modulefiles
ml load spack-pe-base/0.8.1
ml load apptainer
ml load e2fsprogs

apptainer version # should return 1.4.0-rc.1+24-g6ae1a25f2
```

Detailed Apptainer documentation is available [here](https://apptainer.org/docs/user/latest/).

## Building Containers from Docker or Argonne GitHub Container Registry

Containers can be built by:
- Creating Dockerfiles locally and publishing to DockerHub, then converting to Apptainer.
- Building directly on ALCF nodes using Apptainer recipe files.

To convert a Docker container to Apptainer on Polaris, use:

```bash
apptainer build --fakeroot pytorch:22.06-py3.sing docker://nvcr.io/nvidia/pytorch:22.06-py3
```

You can find the latest prebuilt Nvidia PyTorch containers [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). The TensorFlow containers are [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow) (though note that LCF doesn't typically prebuild the TF-1 containers). You can search the full container registry [here](https://catalog.ngc.nvidia.com/containers). For custom containers tailored for Polaris, visit [ALCF's GitHub container registry](https://github.com/argonne-lcf/container-registry/tree/main).
Find prebuilt NVIDIA PyTorch containers [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). TensorFlow containers are [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow). Search the full container registry [here](https://catalog.ngc.nvidia.com/containers). Custom containers tailored for Polaris are available in [ALCF's GitHub container registry](https://github.com/argonne-lcf/container-registry/tree/main).

> **Note:** Currently, container build and executions are only supported on the Polaris compute nodes.
> **Note:** Container build and execution are only supported on Polaris compute nodes.
## Running Containers on Polaris

To run a container on Polaris, you can use the submission script described [here](https://github.com/argonne-lcf/container-registry/blob/main/containers/mpi/Polaris/job_submission.sh). Below is an explanation of the job submission script.
Use the submission script detailed [here](https://github.com/argonne-lcf/container-registry/blob/main/containers/mpi/Polaris/job_submission.sh). Example job script:

```bash
#!/bin/sh
Expand All @@ -51,124 +66,65 @@ To run a container on Polaris, you can use the submission script described [here
#PBS -l filesystems=home:eagle
#PBS -A <project_name>
cd ${PBS_O_WORKDIR}
echo $CONTAINER
```

We move to the current working directory and enable network access at runtime by setting the proxy. We also load Apptainer.

```bash
# SET proxy for internet access
ml use /soft/modulefiles
ml load spack-pe-base/0.8.1
ml load apptainer
ml spack-pe-base/0.8.1
ml use /soft/spack/testing/0.8.1/modulefiles
ml apptainer/main
ml load e2fsprogs

export BASE_SCRATCH_DIR=/local/scratch/
export APPTAINER_TMPDIR=$BASE_SCRATCH_DIR/apptainer-tmpdir
mkdir -p $APPTAINER_TMPDIR
export APPTAINER_CACHEDIR=$BASE_SCRATCH_DIR/apptainer-cachedir
mkdir -p $APPTAINER_CACHEDIR

# Proxy setup for internet access
export HTTP_PROXY=http://proxy.alcf.anl.gov:3128
export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128
export http_proxy=http://proxy.alcf.anl.gov:3128
export https_proxy=http://proxy.alcf.anl.gov:3128
```

This is important for the system (Polaris - Cray) mpich to bind to the container's mpich. Set the following environment variables:

```bash
ADDITIONAL_PATH=/opt/cray/pe/pals/1.2.12/lib
# Environment variables for MPI
export ADDITIONAL_PATH=/opt/cray/pe/pals/1.2.12/lib
module load cray-mpich-abi
export APPTAINERENV_LD_LIBRARY_PATH="$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH:$ADDITIONAL_PATH"
```

Set the number of ranks per node spread as per your scaling requirements:

```bash
# MPI example w/ 16 MPI ranks per node spread evenly across cores
NODES=`wc -l < $PBS_NODEFILE`
# Set MPI ranks
NODES=$(wc -l < $PBS_NODEFILE)
PPN=16
PROCS=$((NODES * PPN))
echo "NUM_OF_NODES= ${NODES} TOTAL_NUM_RANKS= ${PROCS} RANKS_PER_NODE= ${PPN}"
```
echo "NUM_OF_NODES=${NODES}, TOTAL_NUM_RANKS=${PROCS}, RANKS_PER_NODE=${PPN}"

Finally, launch your script:
# Launch the container
mpiexec -hostfile $PBS_NODEFILE -n $PROCS -ppn $PPN apptainer exec --fakeroot -B /opt -B /var/run/palsd/ $CONTAINER /usr/source/mpi_hello_world

```bash
echo C++ MPI
mpiexec -hostfile $PBS_NODEFILE -n $PROCS -ppn $PPN apptainer exec -B /opt -B /var/run/palsd/ $CONTAINER /usr/source/mpi_hello_world

echo Python MPI
mpiexec -hostfile $PBS_NODEFILE -n $PROCS -ppn $PPN apptainer exec -B /opt -B /var/run/palsd/ $CONTAINER python3 /usr/source/mpi_hello_world.py
# Python example
mpiexec -hostfile $PBS_NODEFILE -n $PROCS -ppn $PPN apptainer exec --fakeroot -B /opt -B /var/run/palsd/ $CONTAINER python3 /usr/source/mpi_hello_world.py
```

The job can be submitted using:
Submit jobs using:

```bash
qsub -v CONTAINER=mpich-4_latest.sif job_submission.sh
```

<!-- --8<-- [start:commoncontainerdoc] -->

## Recipe-Based Container Building
## Available Containers

As mentioned earlier, you can build Apptainer containers from recipe files. Instructions are available [here](https://apptainer.org/docs/user/1.2/build_a_container.html#building-containers-from-apptainer-definition-files). See [available containers](#available-containers) for more recipes.
- Examples for running MPICH containers can be found [here](https://github.com/argonne-lcf/container-registry/tree/main/containers/mpi/Polaris).
- Examples for running databases can be found [here](https://github.com/argonne-lcf/container-registry/tree/main/containers/databases).
- Examples for using SHPC (containers as modules) can be found [here](https://github.com/argonne-lcf/container-registry/tree/main/containers/shpc).

> Note: You can also build custom recipes by bootstrapping from prebuilt images. For example, the first two lines in a recipe to use our custom TensorFlow implementation would be `Bootstrap: oras` followed by `From: ghcr.io/argonne-lcf/tf2-mpich-nvidia-gpu:latest`.
## Available containers

If you just want to know what containers are available, here you go:

* Examples for running MPICH containers can be found [here](https://github.com/argonne-lcf/container-registry/tree/main/containers/mpi/Polaris).

* Examples for running databases can be found [here](https://github.com/argonne-lcf/container-registry/tree/main/containers/databases).

* For using shpc - that allows for running containers as modules. It can be found [here](https://github.com/argonne-lcf/container-registry/tree/main/containers/shpc).

The latest containers are updated periodically. If you have trouble using containers or request a newer or a different container, please contact ALCF support at `[email protected]`.

## Troubleshooting Common Issues

**Permission Denied Error**: If you encounter permission errors during the build:

* Check your quota and delete any unnecessary files.

* Clean up the Apptainer cache, `~/.apptainer/cache`, and set the Apptainer tmp and cache directories as below. If your home directory is full and if you are building your container on a compute node, then set the tmpdir and cachedir to local scratch:

```bash
export BASE_SCRATCH_DIR=/local/scratch/ # FOR POLARIS
#export BASE_SCRATCH_DIR=/raid/scratch/ # FOR SOPHIA
export APPTAINER_TMPDIR=$BASE_SCRATCH_DIR/apptainer-tmpdir
mkdir $APPTAINER_TMPDIR
export APPTAINER_CACHEDIR=$BASE_SCRATCH_DIR/apptainer-cachedir/
mkdir $APPTAINER_CACHEDIR
```

* Make sure you are not in a directory accessed with a symbolic link, i.e., check if `pwd` and `pwd -P` return the same path.

* If any of the above doesn't work, try running the build in your home directory.

**Mapping to rank 0 on all nodes**: Ensure that the container's MPI aligns with the system MPI. For example, follow the additional steps outlined in the [container registry documentation for MPI on Polaris](https://github.com/argonne-lcf/container-registry/tree/main/containers/mpi/Polaris).

**libmpi.so.40 not found**: This can happen if the container's application has an OpenMPI dependency, which is not currently supported on Polaris. It can also spring up if the container's base environment is not a Debian-based architecture such as Ubuntu. Ensure the application has an MPICH implementation as well. Also, try removing `.conda/`, `.cache/`, and `.local/` folders from your home directory and rebuilding the container.

**Disabled Port mapping, user namespace, and [network virtualization]** [Network virtualization](https://apptainer.org/docs/user/main/networking.html) is disabled for the container due to security constraints. See issue [#2533](https://github.com/apptainer/apptainer/issues/2553).

!!! bug "Apptainer instance errors with version 1.3.2"

Use `nohup` and `&` as an alternative if you want to run Apptainer as a background process. See below for an example of running Postgres as a background process:
```bash linenums="1"
nohup apptainer run
-B pgrun:/var/run/postgresql \
-B pgdata:/var/lib/postgresql/data \
--env-file pg.env \
postgres.sing postgres &

# 3) Capture its PID so we can kill it later
echo $! > postgres_pid.txt
echo "Started Postgres in the background with PID $(cat postgres_pid.txt)"
- **Permission Denied:** Check your quota, clean Apptainer cache (`~/.apptainer/cache`), or set directories to local scratch (`/local/scratch/`).
- **MPI Issues:** Ensure MPI compatibility by following [MPI container registry docs](https://github.com/argonne-lcf/container-registry/tree/main).
- **libmpi.so.40 not found:** Use MPICH-compatible base images.
- **Disabled Network Virtualization:** Network virtualization is disabled due to security constraints ([details](https://apptainer.org/docs/user/main/networking.html)).
- **Starter-suid Error:** Always use the `--fakeroot` flag on Polaris compute nodes.

# 4) Perform whatever work you need while Postgres is running
# In this demo, we just sleep for 30 minutes (1800 seconds).
sleep 1800
For further assistance, contact ALCF support: `[email protected]`.

# 5) Kill the background process at the end of the job
kill "$(cat postgres_pid.txt)"
rm postgres_pid.txt
```
<!-- --8<-- [end:commoncontainerdoc] -->
Loading

0 comments on commit 5b6fbc6

Please sign in to comment.