Merge branch 'main' into feature/acronyms

argonne-lcf · Mar 6, 2025 · 5b6fbc6 · 5b6fbc6
2 parents 971c1d0 + 0cb0e38
commit 5b6fbc6
Show file tree

Hide file tree

Showing 9 changed files with 684 additions and 801 deletions.
diff --git a/docs/account-project-management/project-management/project-reports.md b/docs/account-project-management/project-management/project-reports.md
@@ -6,19 +6,19 @@ PIs of INCITE, ALCC, and ADSP projects are required to complete quarterly report
 
 ## Due dates
 
-### Due dates for the 2024 INCITE quarterly, EOY, and the EOP reports:
+### Due dates for the 2025 INCITE quarterly, EOY, and the EOP reports:
 
-- April 1, 2024 (CY2024 - Q1)
-- July 1, 2024 (CY2024 - Q2)
-- October 1, 2024 (CY2024 - Q3)
-- January 1, 2025 (CY2025 - EOY) or February 15, 2025 (entire allocation period - EOP)
+- April 1, 2025 (CY2025 - Q1)
+- July 1, 2025 (CY2025 - Q2)
+- October 1, 2025 (CY2025 - Q3)
+- January 1, 2026 (CY2025 - EOY) or February 15, 2026 (entire allocation period - EOP)
 
 ### Due dates for the 2023-2024 ALCC quarterly and the EOP reports:
 
-- October 1, 2023 (CY2023 - Q3)
-- January 1, 2024 (CY2024 - Q4)
-- April 1, 2024 (CY2024 - Q1)
-- August 15, 2024 (CY2024 - EOP)
+- October 1, 2024 (CY2024 - Q3)
+- January 1, 2025 (CY2025 - Q4)
+- April 1, 2024 (CY2025 - Q1)
+- August 15, 2025 (CY2025 - EOP)
 
 ## Penalties
 

diff --git a/docs/aurora/programming-models/hip-aurora.md b/docs/aurora/programming-models/hip-aurora.md
@@ -1,6 +1,6 @@
 # HIP on Aurora
 
-Applications which use the HIP programming model are supported on Aurora via [chipStar](https://github.com/CHIP-SPV/chipStar). 
+ALCF experimentally supports applications which use the HIP programming model via [chipStar](https://github.com/CHIP-SPV/chipStar). 
 
 ## Example
 
@@ -77,4 +77,4 @@ int main(void)
 Max error: 0.000000
 ```
 
-There are additional details in the [chipStar user documentation](https://github.com/CHIP-SPV/chipStar/blob/main/docs/Using.md).
+There are additional details in the [chipStar user documentation](https://github.com/CHIP-SPV/chipStar/blob/main/docs/Using.md).
diff --git a/docs/aurora/services/gitlab-ci.md b/docs/aurora/services/gitlab-ci.md
@@ -2,37 +2,3 @@
 
 ### As of Febuary 25 2025, no changes from the [general documentation](https://docs.alcf.anl.gov/services/gitlab-ci/) for GitLab-CI are required for Aurora.
 
-
-### The below information is *only* for those users who have projects created prior to that date and that still use `gitlab-sunspot.alcf.anl.gov`:
-
-
-Currently, [https://gitlab-sunspot.alcf.anl.gov](https://gitlab-sunspot.alcf.anl.gov) must be accessed via a proxy.
-
-
-The following command will connect to an Aurora login node from your local system and establish the required proxy:[^1]
-```bash linenums="1"
-ssh aurora.alcf.anl.gov -D localhost:25565
-```
-
-## Instructions for Firefox Browser
-
-In order to use the proxy, you must configure your local web browser to use a SOCKS proxy. The instructions for other browsers are similar.
-
-1. Open Firefox settings
-2. Navigate to "General" > "Network Settings" > "Settings" 
-    <small>(at the bottom of the General settings page.)</small>
-3. Ensure "Manual proxy configuration" is selected
-4. Fill the "SOCKS Host" field with `localhost`
-5. Fill the associated port field with `25565` (or the alternate port you specified in your SSH command)
-6. Ensure "SOCKS v5" is selected
-7. Ensure "Proxy DNS when using SOCKS v5" is selected"
-8. Select "OK"
-
-!!! warning 
-
-	You will not have internet access in Firefox while using the proxy. Select "No proxy" to re-enable internet access.
-
-For ease of use, many users have had success using extensions like FoxyProxy, or using a separate web browser for accessing resources that require the proxied connection.
-
-
-[^1]: `25565` is the proxy port, it may be changed as needed.
diff --git a/docs/contacting-support/alcf-users-slack.md b/docs/contacting-support/alcf-users-slack.md
@@ -0,0 +1,22 @@
+# ALCF Users Slack Workspace
+
+The ALCF Users Slack workspace is a platform intended for current, active ALCF users[^1] where the user community can interact, collaborate, and help one another. This workspace is a pilot program valid for 1 year. 
+This workspace is a platform for the user community to interact and collaborate. ALCF staff may have a limited presence or not be present at all times to assist users with their queries. **The official mechanism for requesting support is to email [[email protected]](mailto:[email protected]) for technical issues and [[email protected]](mailto:[email protected]) for account and access-related issues.**
+
+## Getting access
+
+ALCF Users Slack access is automatically provided to members of INCITE, ALCC, and ESP projects. Certain (**see footnote**) DD projects[^1] can request Slack access for their teams by submitting a support ticket to [[email protected]](mailto:[email protected]). 
+
+[^1]: ALCF users working on scientific or research campaigns across various allocation types are eligible for access to the ALCF Users Slack workspace. Users participating in training events, instructional courses, or lighthouse projects are **not** eligible for User slack access.
+
+## Logging into Slack
+
+ALCF Users Slack uses ALCF credentials for access. Active ALCF users who have been granted access should log in here: [https://alcf-users.slack.com](https://alcf-users.slack.com).
+
+## Using Slack Channels
+
+Once you are logged into the Slack workspace, your default channels will show up in the navigation pane. All system-specific announcements will be published on the `#announcements` channel. You can browse and join existing public channels or create private channels for your project. You can send direct messages to your collaborators. While you cannot create public channels, you can email [[email protected]](mailto:[email protected]) to request for one.
+
+The ALCF User Slack workspace should not be used to discuss, store, or operate any NDA/RSNDA, Official Use Only (OUO), or Business Sensitive information.
+
+Argonne's **[code of conduct](https://www.alcf.anl.gov/about/code-of-conduct)** governs the use of all ALCF resources. By joining and using the workspace, you agree to these terms. Users violating Argonne's code of conduct may be removed from the workspace.
diff --git a/docs/polaris/containers/containers.md b/docs/polaris/containers/containers.md
@@ -1,46 +1,61 @@
 # Containers on Polaris
-Polaris, powered by NVIDIA A100 GPUs, benefits from container-based workloads for seamless compatibility across NVIDIA systems. This guide details the use of containers on Polaris, including custom container creation, large-scale execution, and common pitfalls.
+
+Polaris, equipped with NVIDIA A100 GPUs, leverages container-based workloads for seamless compatibility across NVIDIA systems. This guide provides detailed instructions on using containers on Polaris, including setup, container creation, large-scale execution, and troubleshooting common issues.
 
 ## Apptainer Setup
 
-Polaris employs Apptainer (formerly known as Singularity) for container management. To set up Apptainer, run:
+Polaris uses Apptainer (formerly Singularity) for container management. Request a compute node as follows:
 
 ```bash
-ml use /soft/modulefiles
-ml load spack-pe-base/0.8.1
-ml load apptainer
-ml load e2fsprogs
-apptainer version #1.2.2
+qsub -I -A <PROJECT_NAME> -q debug -l select=1 -l walltime=01:00:00 -l filesystems=home:grand:eagle -l singularity_fakeroot=true # Debug queue for 1 hour
 ```
 
-The Apptainer version on Polaris is 1.2.2. Detailed user documentation is available [here](https://apptainer.org/docs/user/1.2/).
+After connecting to the compute node, load Apptainer and necessary modules:
 
-## Building from Docker or Argonne GitHub Container Registry
+```bash
+ml use /soft/modulefiles
+ml spack-pe-base/0.8.1
+ml use /soft/spack/testing/0.8.1/modulefiles
+ml apptainer/main
+ml load e2fsprogs
 
-Containers on Polaris can be built by writing Dockerfiles on a local machine and then publishing the container to DockerHub, or by directly building them on an ALCF compute node by writing an Apptainer recipe file. If you prefer to use existing containers, you can pull them from various registries like DockerHub and run them on Polaris.
+export BASE_SCRATCH_DIR=/local/scratch/ # For Polaris
+export APPTAINER_TMPDIR=$BASE_SCRATCH_DIR/apptainer-tmpdir
+mkdir -p $APPTAINER_TMPDIR
 
-Since Docker requires root privileges, which users do not have on Polaris, existing Docker containers must be converted to Apptainer. To build a Docker-based container on Polaris, use the following as an example:
+export APPTAINER_CACHEDIR=$BASE_SCRATCH_DIR/apptainer-cachedir
+mkdir -p $APPTAINER_CACHEDIR
 
-```bash
-qsub -I -A <Project> -q debug -l select=1 -l walltime=01:00:00 -l filesystems=home:eagle -l singularity_fakeroot=true
+# For internet access
 export HTTP_PROXY=http://proxy.alcf.anl.gov:3128
 export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128
 export http_proxy=http://proxy.alcf.anl.gov:3128
 export https_proxy=http://proxy.alcf.anl.gov:3128
-ml use /soft/modulefiles
-ml load spack-pe-base/0.8.1
-ml load apptainer
-ml load e2fsprogs
+
+apptainer version # should return 1.4.0-rc.1+24-g6ae1a25f2
+```
+
+Detailed Apptainer documentation is available [here](https://apptainer.org/docs/user/latest/).
+
+## Building Containers from Docker or Argonne GitHub Container Registry
+
+Containers can be built by:
+- Creating Dockerfiles locally and publishing to DockerHub, then converting to Apptainer.
+- Building directly on ALCF nodes using Apptainer recipe files.
+
+To convert a Docker container to Apptainer on Polaris, use:
+
+```bash
 apptainer build --fakeroot pytorch:22.06-py3.sing docker://nvcr.io/nvidia/pytorch:22.06-py3
 ```
 
-You can find the latest prebuilt Nvidia PyTorch containers [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). The TensorFlow containers are [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow) (though note that LCF doesn't typically prebuild the TF-1 containers). You can search the full container registry [here](https://catalog.ngc.nvidia.com/containers). For custom containers tailored for Polaris, visit [ALCF's GitHub container registry](https://github.com/argonne-lcf/container-registry/tree/main).
+Find prebuilt NVIDIA PyTorch containers [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). TensorFlow containers are [here](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow). Search the full container registry [here](https://catalog.ngc.nvidia.com/containers). Custom containers tailored for Polaris are available in [ALCF's GitHub container registry](https://github.com/argonne-lcf/container-registry/tree/main).
 
-> **Note:** Currently, container build and executions are only supported on the Polaris compute nodes.
+> **Note:** Container build and execution are only supported on Polaris compute nodes.
 
 ## Running Containers on Polaris
 
-To run a container on Polaris, you can use the submission script described [here](https://github.com/argonne-lcf/container-registry/blob/main/containers/mpi/Polaris/job_submission.sh). Below is an explanation of the job submission script.
+Use the submission script detailed [here](https://github.com/argonne-lcf/container-registry/blob/main/containers/mpi/Polaris/job_submission.sh). Example job script:
 
 ```bash
 #!/bin/sh
@@ -51,124 +66,65 @@ To run a container on Polaris, you can use the submission script described [here
 #PBS -l filesystems=home:eagle
 #PBS -A <project_name>
 cd ${PBS_O_WORKDIR}
-echo $CONTAINER
-```
-
-We move to the current working directory and enable network access at runtime by setting the proxy. We also load Apptainer.
 
-```bash
-# SET proxy for internet access
 ml use /soft/modulefiles
-ml load spack-pe-base/0.8.1
-ml load apptainer
+ml spack-pe-base/0.8.1
+ml use /soft/spack/testing/0.8.1/modulefiles
+ml apptainer/main
 ml load e2fsprogs
+
+export BASE_SCRATCH_DIR=/local/scratch/
+export APPTAINER_TMPDIR=$BASE_SCRATCH_DIR/apptainer-tmpdir
+mkdir -p $APPTAINER_TMPDIR
+export APPTAINER_CACHEDIR=$BASE_SCRATCH_DIR/apptainer-cachedir
+mkdir -p $APPTAINER_CACHEDIR
+
+# Proxy setup for internet access
 export HTTP_PROXY=http://proxy.alcf.anl.gov:3128
 export HTTPS_PROXY=http://proxy.alcf.anl.gov:3128
 export http_proxy=http://proxy.alcf.anl.gov:3128
 export https_proxy=http://proxy.alcf.anl.gov:3128
-```
-
-This is important for the system (Polaris - Cray) mpich to bind to the container's mpich. Set the following environment variables:
 
-```bash
-ADDITIONAL_PATH=/opt/cray/pe/pals/1.2.12/lib
+# Environment variables for MPI
+export ADDITIONAL_PATH=/opt/cray/pe/pals/1.2.12/lib
 module load cray-mpich-abi
 export APPTAINERENV_LD_LIBRARY_PATH="$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH:$ADDITIONAL_PATH"
-```
-
-Set the number of ranks per node spread as per your scaling requirements:
 
-```bash
-# MPI example w/ 16 MPI ranks per node spread evenly across cores
-NODES=`wc -l < $PBS_NODEFILE`
+# Set MPI ranks
+NODES=$(wc -l < $PBS_NODEFILE)
 PPN=16
 PROCS=$((NODES * PPN))
-echo "NUM_OF_NODES= ${NODES} TOTAL_NUM_RANKS= ${PROCS} RANKS_PER_NODE= ${PPN}"
-```
+echo "NUM_OF_NODES=${NODES}, TOTAL_NUM_RANKS=${PROCS}, RANKS_PER_NODE=${PPN}"
 
-Finally, launch your script:
+# Launch the container
+mpiexec -hostfile $PBS_NODEFILE -n $PROCS -ppn $PPN apptainer exec --fakeroot -B /opt -B /var/run/palsd/ $CONTAINER /usr/source/mpi_hello_world
 
-```bash
-echo C++ MPI
-mpiexec -hostfile $PBS_NODEFILE -n $PROCS -ppn $PPN apptainer exec -B /opt -B /var/run/palsd/ $CONTAINER /usr/source/mpi_hello_world
-
-echo Python MPI
-mpiexec -hostfile $PBS_NODEFILE -n $PROCS -ppn $PPN apptainer exec -B /opt -B /var/run/palsd/ $CONTAINER python3 /usr/source/mpi_hello_world.py
+# Python example
+mpiexec -hostfile $PBS_NODEFILE -n $PROCS -ppn $PPN apptainer exec --fakeroot -B /opt -B /var/run/palsd/ $CONTAINER python3 /usr/source/mpi_hello_world.py
 ```
 
-The job can be submitted using:
+Submit jobs using:
 
 ```bash
 qsub -v CONTAINER=mpich-4_latest.sif job_submission.sh
 ```
-
 <!-- --8<-- [start:commoncontainerdoc] -->
 
-## Recipe-Based Container Building
+## Available Containers
 
-As mentioned earlier, you can build Apptainer containers from recipe files. Instructions are available [here](https://apptainer.org/docs/user/1.2/build_a_container.html#building-containers-from-apptainer-definition-files). See [available containers](#available-containers) for more recipes.
+- Examples for running MPICH containers can be found [here](https://github.com/argonne-lcf/container-registry/tree/main/containers/mpi/Polaris).
+- Examples for running databases can be found [here](https://github.com/argonne-lcf/container-registry/tree/main/containers/databases).
+- Examples for using SHPC (containers as modules) can be found [here](https://github.com/argonne-lcf/container-registry/tree/main/containers/shpc).
 
-> Note: You can also build custom recipes by bootstrapping from prebuilt images. For example, the first two lines in a recipe to use our custom TensorFlow implementation would be `Bootstrap: oras` followed by `From: ghcr.io/argonne-lcf/tf2-mpich-nvidia-gpu:latest`.
-
-## Available containers
-
-If you just want to know what containers are available, here you go:
-
-* Examples for running MPICH containers can be found [here](https://github.com/argonne-lcf/container-registry/tree/main/containers/mpi/Polaris).
-
-* Examples for running databases can be found [here](https://github.com/argonne-lcf/container-registry/tree/main/containers/databases).
-
-* For using shpc - that allows for running containers as modules. It can be found [here](https://github.com/argonne-lcf/container-registry/tree/main/containers/shpc).
-
-The latest containers are updated periodically. If you have trouble using containers or request a newer or a different container, please contact ALCF support at `[email protected]`.
 
 ## Troubleshooting Common Issues
 
-**Permission Denied Error**: If you encounter permission errors during the build:
-
-* Check your quota and delete any unnecessary files.
-
-* Clean up the Apptainer cache, `~/.apptainer/cache`, and set the Apptainer tmp and cache directories as below. If your home directory is full and if you are building your container on a compute node, then set the tmpdir and cachedir to local scratch:
-
-```bash
-export BASE_SCRATCH_DIR=/local/scratch/ # FOR POLARIS
-#export BASE_SCRATCH_DIR=/raid/scratch/ # FOR SOPHIA
-export APPTAINER_TMPDIR=$BASE_SCRATCH_DIR/apptainer-tmpdir
-mkdir $APPTAINER_TMPDIR
-export APPTAINER_CACHEDIR=$BASE_SCRATCH_DIR/apptainer-cachedir/
-mkdir $APPTAINER_CACHEDIR
-```
-
-* Make sure you are not in a directory accessed with a symbolic link, i.e., check if `pwd` and `pwd -P` return the same path.
-
-* If any of the above doesn't work, try running the build in your home directory.
-
-**Mapping to rank 0 on all nodes**: Ensure that the container's MPI aligns with the system MPI. For example, follow the additional steps outlined in the [container registry documentation for MPI on Polaris](https://github.com/argonne-lcf/container-registry/tree/main/containers/mpi/Polaris).
-
-**libmpi.so.40 not found**: This can happen if the container's application has an OpenMPI dependency, which is not currently supported on Polaris. It can also spring up if the container's base environment is not a Debian-based architecture such as Ubuntu. Ensure the application has an MPICH implementation as well. Also, try removing `.conda/`, `.cache/`, and `.local/` folders from your home directory and rebuilding the container.
-
-**Disabled Port mapping, user namespace, and [network virtualization]** [Network virtualization](https://apptainer.org/docs/user/main/networking.html) is disabled for the container due to security constraints. See issue [#2533](https://github.com/apptainer/apptainer/issues/2553).
-
-!!! bug "Apptainer instance errors with version 1.3.2"
-
-    Use `nohup` and `&` as an alternative if you want to run Apptainer as a background process. See below for an example of running Postgres as a background process:
-    ```bash linenums="1"
-     nohup apptainer run 
-     -B pgrun:/var/run/postgresql \
-     -B pgdata:/var/lib/postgresql/data \
-     --env-file pg.env \
-     postgres.sing postgres &
-
-     # 3) Capture its PID so we can kill it later
-     echo $! > postgres_pid.txt
-     echo "Started Postgres in the background with PID $(cat postgres_pid.txt)"
+- **Permission Denied:** Check your quota, clean Apptainer cache (`~/.apptainer/cache`), or set directories to local scratch (`/local/scratch/`).
+- **MPI Issues:** Ensure MPI compatibility by following [MPI container registry docs](https://github.com/argonne-lcf/container-registry/tree/main).
+- **libmpi.so.40 not found:** Use MPICH-compatible base images.
+- **Disabled Network Virtualization:** Network virtualization is disabled due to security constraints ([details](https://apptainer.org/docs/user/main/networking.html)).
+- **Starter-suid Error:** Always use the `--fakeroot` flag on Polaris compute nodes.
 
-    # 4) Perform whatever work you need while Postgres is running
-    #    In this demo, we just sleep for 30 minutes (1800 seconds).
-    sleep 1800
+For further assistance, contact ALCF support: `[email protected]`.
 
-    # 5) Kill the background process at the end of the job
-    kill "$(cat postgres_pid.txt)"
-    rm postgres_pid.txt
-    ```
 <!-- --8<-- [end:commoncontainerdoc] -->