Skip to content

[Manual] Devito on ARCHER2

George Bisbas edited this page Nov 10, 2022 · 42 revisions

Installing Devito on ARCHER2

Before you start, important readings:
https://docs.archer2.ac.uk/quick-start/quickstart-users/
https://docs.archer2.ac.uk/user-guide/tuning/

Important: Parallel jobs on ARCHER2 should be run from the work file systems as the home file systems are not available on the compute nodes - you will see a chdir or file not found error if you try to access data on the home file system within a parallel job running on the compute nodes.

# After completing the registration
# Do `ssh` to your login node (private key needed)
ssh [email protected] -vv

# ch-dir to work filesystem
cd /work/"project-number"/"project-number"/"username"/
module load cray-python
# Create a python3 virtual env
# Activate it

# If devito is not cloned:
git clone https://github.com/devitocodes/devito
pip3 install -e .

# Load Cray MPI / https://docs.nersc.gov/development/programming-models/mpi/cray-mpich/
module load cray-mpich
# Build mpi4py using Cray's wrapper
env MPICC=/opt/cray/pe/craype/2.7.6/bin/cc pip3 install -r requirements-mpi.txt
export OMP_PLACES=cores

Example script:

#!/bin/bash

# Slurm job options (job-name, compute nodes, job time)
#SBATCH --job-name=Example_MPI_Job
#SBATCH --time=0:20:0
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=16

# Replace [budget code] below with your project code (e.g. t01)
#SBATCH --account=[budget code] 
#SBATCH --partition=standard
#SBATCH --qos=standard


# Set the number of threads to 16 and specify placement
#   There are 16 OpenMP threads per MPI process
#   We want one thread per physical core
export OMP_NUM_THREADS=16
export OMP_PLACES=cores

# Launch the parallel job
#   Using 32 MPI processes
#   8 MPI processes per node
#   16 OpenMP threads per MPI process
#   Additional srun options to pin one thread per physical core
srun --hint=nomultithread --distribution=block:block ./my_mixed_executable.x arg1 arg2

salloc --nodes=2 --ntasks-per-node=8 --cpus-per-task=16 --time=01:00:00 --partition=standard --qos=standard --account=d011
# In interactive job

# Allocated nodes
OMP_NUM_THREADS=16 DEVITO_MPI=1 DEVITO_ARCH=cray DEVITO_LANGUAGE=openmp DEVITO_LOGGING=DEBUG srun --distribution=block:block --hint=nomultithread python examples/seismic/acoustic/acoustic_example.py -d 1024 1024 1024 --tn 512 -so 12 -a aggressive

# Nodes 1
OMP_NUM_THREADS=16 DEVITO_MPI=1 DEVITO_ARCH=cray DEVITO_LANGUAGE=openmp DEVITO_LOGGING=DEBUG srun -n 8 --distribution=block:block --hint=nomultithread python examples/seismic/acoustic/acoustic_example.py -d 512 512 512 --tn 100

# Nodes 2
OMP_NUM_THREADS=16 DEVITO_MPI=1 DEVITO_ARCH=cray DEVITO_LANGUAGE=openmp DEVITO_LOGGING=DEBUG srun -n 16 --distribution=block:block --hint=nomultithread python examples/seismic/acoustic/acoustic_example.py -d 512 512 512 --tn 512 -so 8

!Add autotuning! Very important!

Notes: autotuning may lead to perf variance from runs to runs. Block shape selected not standard?

For interactive jobs:

https://docs.archer2.ac.uk/user-guide/scheduler/#interactive-jobs

Notes:

export FI_OFI_RXM_SAR_LIMIT=524288
export FI_OFI_RXM_BUFFER_SIZE=131072
export MPICH_SMP_SINGLE_COPY_SIZE=16384