This repo contains installation, tutorials, and examples for two main core CS technologies from our PSAAPIII center:
- PyKokkos: A Python-to-Kokkos interface & JIT compiler
- Parla: Python Libary for task parallel programming for heterogeneous single-node development
We provide a Docker container at wlruys/nuwest
for easy deployment as Kokkos may take over 30 minutes to build, but visitors are welcome to build from source using the instructions below if their machine requires it.
- Docker or Apptainer
- TODO: Test Podman on Lassen
- For GPU support: NVIDIA Container Toolkit
- 10GB Free HDD Space (for the container)
If running on a TACC system, we use the Apptainer container runtime. This can be loaded with the following command:
module load tacc-apptainer
We provide scripts for both an apptainer
and docker
container runtime at runner/apptainer
and runner/docker
respectively. Below we show instructions with the docker
runtime, but the apptainer
runtime can be used by replacing docker
with apptainer
in the commands below.
chmod +x runner/docker/*.sh
./runner/docker/install.sh <container name>
Use the container that best matches your system:
System | Container |
---|---|
CPU-only | wlruys/nuwest:cpu |
CUDA 11.3 (SM70) | wlruys/nuwest:volta-multi |
CUDA 11.3 (SM75) | wlruys/nuwest:turing-multi |
CUDA 11.3 (SM80) | wlruys/nuwest:ampere-multi |
./runner/docker/run.sh lessons/parla/scripts/01_hello.py
This will run the script in the container and print the output to the terminal.
The --use-gpu
flag is available to run on a GPU-enabled container.
./runner/docker/notebook.sh
This will launch a Jupyter Notebook Server on port 8888 with password NUWEST2024
.
The --use-gpu
flag is available to run on a GPU-enabled container.
If you are running the container on a remote machine, you can connect to the Jupyter Notebook Server via SSH tunneling.
ssh -L <local_port>:localhost:8888 <username>@<remote machine>
Then, open a browser and navigate to localhost:<local_port>
to access the Jupyter Notebook Server.
If you are running the container on a SLURM cluster, you must start the Jupyter Notebook Server on a compute node. You will need to request an allocation with 1 node and at least 1 GPU (if using a GPU-enabled container) and run the ./runner/docker/notebook.sh script on the compute node itself.
The connection will need to be forwarded back to your local machine via SSH tunneling. This can be done via a ProxyJump to the compute node through the login node.
While we have tested on TACC, note that the firewall settings on your cluster may prevent you from opening and forwarding the necessary ports.
conda
ormamba
(recommended) Python package managercmake
>= 3.28,git
- C++17 compatible compiler (e.g.
gcc
>= 7.3.0) - CUDA 11.3 (optional, for GPU support)
Assumes a sh
compatible shell in a Linux environment.
wget -q https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh -O Miniforge.sh
INSTALL_DIR=<location> bash Miniforge.sh -b -p $INSTALL_DIR/miniforge
mamba init
source $INSTALL_DIR/miniforge/etc/profile.d/conda.sh
mamba create -n nuwest_pykokkos python==3.11
conda activate nuwest_pykokkos
In the following installations, we will assume that the nuwest_pykokkos
environment is active.
We will use the pip installer over mamba due to CuPy (and potentially numba) installations.
pip will more easily build or stub against the system CUDA libraries without pulling in a potentially conflicting cuda-toolkit package.
# Install Python dependencies
pip install -U pip setuptools wheel
pip install numpy scikit-build pytest pyyaml psutil jupyter ipython jupyterlab notebook
mamba install pybind11>=2.11.1 cmake>=3.28 patchelf>=0.17.2
pip install cupy-cuda11x #(optional, for GPU support)
# Clone and install PyKokkos Base (compiles Kokkos)
git clone https://github.com/kokkos/pykokkos-base
cd pykokkos-base
python setup.py install -- \
-DENABLE_LAYOUTS=ON \
-DENABLE_OPENMP=ON \
-DCMAKE_CXX_STANDARD=17 \
-DENABLE_MEMORY_TRAITS=OFF \
-DENABLE_VIEW_RANKS=3 \
-DENABLE_THREADS=OFF \
-DENABLE_CUDA=ON \
-DKokkos_ARCH_TURING75=ON \
cd ..
# Clone and install PyKokkos Interface
git clone https://github.com/kokkos/pykokkos
cd pykokkos
python -m pip install .
cd ..
The architecture flag -DKokkos_ARCH_TURING75=ON
should be changed to match your GPU architecture.
See Kokkos' documentation for the list of architecture flags.
Note that building Kokkos may take a very long time as the python setup.py install
command will build Kokkos from source and is currently single threaded.
To speed up the process you can manually build Kokkos with multiple threads using the following workaround.
python setup.py build -- \
-DENABLE_LAYOUTS=ON \
-DENABLE_OPENMP=ON \
-DCMAKE_CXX_STANDARD=17 \
...
# all other flags
# Kill the PyKokkos-Base build process after it has started and reached (1%)
# Ctrl + C
cd _scikit_build/<arch>/cmake_build
make -j <num threads>
cd ../../../
python setup.py install -- \
-DENABLE_LAYOUTS=ON \
-DENABLE_OPENMP=ON \
-DCMAKE_CXX_STANDARD=17 \
...
# all other flags
# Install Python dependencies
pip install numpy pyyaml psutil jupyter ipython jupyterlab notebook cython pytest scikit-build-core
pip install cupy-cuda11x #(optional, for GPU support)
# Clone and install Parla
git clone https://github.com/ut-parla/parla-experimental
cd parla-experimental
git submodule update --init --recursive
python -m pip install . --verbose
cd ..
See NVIDIA Nsight Systems for more information. Additional details and configuration is avaiable in NVIDIA's documentation.
The tutorial scripts assume that NSight Systems 2023.4 is available on the system path as nsys-ui
and nsys
.
Older versions of NSight Systems will not support the python-gil
trace option.
This option can be removed from the tutorial scripts if necessary.
Mixing versions of nsys
and nsys-ui
may work but is not recommended.