Skip to content

Commit

Permalink
Merge pull request #241 from mwestphall/SOFTWARE-6086-container-gpu-s…
Browse files Browse the repository at this point in the history
…upport

SOFTWARE-6086: Container GPU Support
  • Loading branch information
mwestphall authored Mar 3, 2025
2 parents 764d49f + d7e2ed1 commit 715d806
Showing 1 changed file with 107 additions and 59 deletions.
166 changes: 107 additions & 59 deletions docs/resource-sharing/os-backfill-containers.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,65 +39,6 @@ In order to configure the container, you will need:
through the [OSPool Token Registry](https://os-registry.opensciencegrid.org/)
1. An [HTTP caching proxy](../data/run-frontier-squid-container.md) at or near your site.

Running the Container with Docker
---------------------------------

The Docker image is kept in [DockerHub](https://hub.docker.com/r/opensciencegrid/osgvo-docker-pilot).
In order to successfully start payload jobs:


1. **Configure authentication:**
Authentication with the OSPool is performed using tokens retrieved from the
[OSPool Token Registry](https://os-registry.opensciencegrid.org/)
which you can then pass to the container by volume mounting it as a file under `/etc/condor/tokens-orig.d/`.
If you are using Docker to launch the container, this is done with the command line flag
`-v /path/to/token:/etc/condor/tokens-orig.d/flock.opensciencegrid.org`.
Replace `/path/to/token` with the full path to the token you obtained from the OSPool Token Registry.
1. Set `GLIDEIN_Site` and `GLIDEIN_ResourceName` to match the resource group name and resource name that you registered
in Topology, respectively.
1. Set the `OSG_SQUID_LOCATION` environment variable to the HTTP address of your preferred Squid instance.
1. _If providing NVIDIA GPU resources:_ Bind-mount `/etc/OpenCL/vendors`, read-only.
If you are using Docker to launch the container, this is done with the command line flags
`-v /etc/OpenCL/vendors:/etc/OpenCL/vendors:ro`.
1. _Strongly_recommended:_ Enable [CVMFS](#recommended-cvmfs) via one of the mechanisms described below.
1. _Strongly recommended:_ If you want job I/O to be done in a separate directory outside of the container,
volume mount the desired directory on the host to `/pilot` inside the container.

Without this, user jobs may compete for disk space with other containers on your system.

If you are using Docker to launch the container, this is done with the command line flag
`-v /worker-temp-dir:/pilot`.
Replace `/worker-temp-dir` with a directory you created for jobs to write into.
Make sure the user you run your container as has write access to this directory.

1. _Optional:_ add an expression with the `GLIDEIN_Start_Extra` environment variable to append to the
[HTCondor `START` expression](https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#the-start-expression);
this limits the pilot to only run certain jobs.

1. _Optional:_ [limit OSG pilot container resource usage](#limiting-resource-usage)

Here is an example invocation using `docker run` by hand:

```
docker run -it --rm --user osg \
--pull=always \
--privileged \
-v /path/to/token:/etc/condor/tokens-orig.d/flock.opensciencegrid.org \
-v /worker-temp-dir:/pilot \
-e GLIDEIN_Site="..." \
-e GLIDEIN_ResourceName="..." \
-e GLIDEIN_Start_Extra="True" \
-e OSG_SQUID_LOCATION="..." \
-e CVMFSEXEC_REPOS=" \
oasis.opensciencegrid.org \
singularity.opensciencegrid.org" \
hub.opensciencegrid.org/osg-htc/ospool-ep:24-release
```

Replace `/path/to/token` with the location you saved the token obtained from the OSPool Token Registry.
Privileged mode (`--privileged`) requested in the above `docker run` allows the container
to mount [CVMFS using cvmfsexec](#cvmfsexec) and invoke `singularity` for user jobs.
Singularity (now known as Apptainer) allows OSPool users to use their own container for their job (e.g., a common use case for GPU jobs).

Running the Container via RPM
---------------------------------
Expand Down Expand Up @@ -148,6 +89,15 @@ On EL hosts, the pilot container can also be managed via a systemctl service pro

- If your site has a [Squid HTTP Caching Proxy](https://osg-htc.org/docs/data/run-frontier-squid-container/) configured,
set `OSG_SQUID_LOCATION` to that proxy's HTTP address.

- If providing NVIDIA GPU resources, set `PROVIDE_NVIDIA_GPU=true`
- This automatically sets variables in accordance with section [Providing GPU Resources](#providing-gpu-resources).

!!! note "GPU Configuration Minimum Version"
The first version of the ospool-ep RPM to provide support for GPU configuration is
24-2. Ensure your installation of the ospool-ep package is up to date using
`yum update ospool-ep`.

1. Start the OSPool EP container service:

:::console
Expand All @@ -159,6 +109,65 @@ On EL hosts, the pilot container can also be managed via a systemctl service pro
root@host # journalctl -f -u ospool-ep


Running the Container with Docker
---------------------------------

The Docker image is kept in [DockerHub](https://hub.docker.com/r/opensciencegrid/osgvo-docker-pilot).
In order to successfully start payload jobs:


1. **Configure authentication:**
Authentication with the OSPool is performed using tokens retrieved from the
[OSPool Token Registry](https://os-registry.opensciencegrid.org/)
which you can then pass to the container by volume mounting it as a file under `/etc/condor/tokens-orig.d/`.
If you are using Docker to launch the container, this is done with the command line flag
`-v /path/to/token:/etc/condor/tokens-orig.d/flock.opensciencegrid.org`.
Replace `/path/to/token` with the full path to the token you obtained from the OSPool Token Registry.
1. Set `GLIDEIN_Site` and `GLIDEIN_ResourceName` to match the resource group name and resource name that you registered
in Topology, respectively.
1. Set the `OSG_SQUID_LOCATION` environment variable to the HTTP address of your preferred Squid instance.
1. _If providing NVIDIA GPU resources:_ See section [Providing GPU Resources](#providing-gpu-resources)
1. _Strongly_recommended:_ Enable [CVMFS](#recommended-cvmfs) via one of the mechanisms described below.
1. _Strongly recommended:_ If you want job I/O to be done in a separate directory outside of the container,
volume mount the desired directory on the host to `/pilot` inside the container.

Without this, user jobs may compete for disk space with other containers on your system.

If you are using Docker to launch the container, this is done with the command line flag
`-v /worker-temp-dir:/pilot`.
Replace `/worker-temp-dir` with a directory you created for jobs to write into.
Make sure the user you run your container as has write access to this directory.

1. _Optional:_ add an expression with the `GLIDEIN_Start_Extra` environment variable to append to the
[HTCondor `START` expression](https://htcondor.readthedocs.io/en/latest/admin-manual/ep-policy-configuration.html#the-start-expression);
this limits the pilot to only run certain jobs.

1. _Optional:_ [limit OSG pilot container resource usage](#limiting-resource-usage)

Here is an example invocation using `docker run` by hand:

```
docker run -it --rm --user osg \
--pull=always \
--privileged \
-v /path/to/token:/etc/condor/tokens-orig.d/flock.opensciencegrid.org \
-v /worker-temp-dir:/pilot \
-e GLIDEIN_Site="..." \
-e GLIDEIN_ResourceName="..." \
-e GLIDEIN_Start_Extra="True" \
-e OSG_SQUID_LOCATION="..." \
-e CVMFSEXEC_REPOS=" \
oasis.opensciencegrid.org \
singularity.opensciencegrid.org" \
hub.opensciencegrid.org/osg-htc/ospool-ep:24-release
```

Replace `/path/to/token` with the location you saved the token obtained from the OSPool Token Registry.
Privileged mode (`--privileged`) requested in the above `docker run` allows the container
to mount [CVMFS using cvmfsexec](#cvmfsexec) and invoke `singularity` for user jobs.
Singularity (now known as Apptainer) allows OSPool users to use their own container for their job (e.g., a common use case for GPU jobs).


Optional Configuration
----------------------

Expand Down Expand Up @@ -235,6 +244,45 @@ docker run -it --rm --user osg \
Fill in the values for `/path/to/token`, `/worker-temp-dir`, `GLIDEIN_Site`, `GLIDEIN_ResourceName`, and `OSG_SQUID_LOCATION` [as above](#running-the-container-with-docker).


### Providing GPU Resources

By default, the container will not detect NVIDIA GPU resources available on its host. To configure
the container for access to its hosts' resources, set the following:

1. Replace the default `24-release` docker image tag with the [CUDA-enabled](https://developer.nvidia.com/cuda-toolkit)
`24-cuda-11_8_0-release` tag

1. Bind-mount `/etc/OpenCL/vendors`, read-only. If you are using Docker to launch the container,
this is done with the command line flags `-v /etc/OpenCL/vendors:/etc/OpenCL/vendors:ro`.


1. Ensure the `singularity.opensciencegrid.org` CVMFS repo is enabled by following one of the methods
described in [CVMFS](#recommended-cvmfs)

1. The NVIDIA runtime is known to conflict with Singularity [PID Namespaces](https://man7.org/linux/man-pages/man7/pid_namespaces.7.html)
Disable PID namespaces by adding the flag `-e SINGULARITY_DISABLE_PID_NAMESPACES=True`

This is the [example at the top of the page](#running-the-container-with-docker), modified
to provide NVIDIA GPU resources:

```hl_lines="6 11 13 15"
docker run -it --rm --user osg \
--pull=always \
--privileged \
-v /path/to/token:/etc/condor/tokens-orig.d/flock.opensciencegrid.org \
-v /worker-temp-dir:/pilot \
-v /etc/OpenCL/vendors:/etc/OpenCL/vendors:ro \
-e GLIDEIN_Site="..." \
-e GLIDEIN_ResourceName="..." \
-e GLIDEIN_Start_Extra="True" \
-e OSG_SQUID_LOCATION="..." \
-e SINGULARITY_DISABLE_NAMESPACES=True \
-e CVMFSEXEC_REPOS=" \
oasis.opensciencegrid.org \
singularity.opensciencegrid.org" \
hub.opensciencegrid.org/osg-htc/ospool-ep:24-cuda_11_8_0-release
```

### Limiting resource usage

By default, the container allows jobs to utilize the entire node's resources (CPUs, memory).
Expand Down

0 comments on commit 715d806

Please sign in to comment.