Using NCAR's Derecho #3669

glwagner · 2024-07-30T16:34:47Z

glwagner
Jul 30, 2024
Maintainer

Overview

NCAR's Derecho supercomputer is housed at the NCAR-Wyoming Supercomputing Center.

Derecho has 82 nodes that each have 128 AMD Milan cores and 4 NVIDIA A100 GPUs. Derecho uses a PBS queuing system.

Note, this post is subject to change. Let's try to keep it up to date, please comment below if something does not work.

Scope

This discussion can cover anything to do with trying to get results from running Oceananigans on Derecho --- including installing Julia, setting up CUDA and MPI, configuring PBS scripts, and using other Julia packages in conjunction with Oceananigans.

Links

Getting started on Derecho with CUDA-Aware MPI

The first task is to download Julia. I opted to manually install a binary in ~/software. Note that I have not tested the following workflow with juliaup or using julia 1.11.1 --- this is a work in progress, so stay tuned. For julia 1.10.6 a binary for derecho can be downloaded by typing

mkdir -p ~/software
wget https://julialang-s3.julialang.org/bin/linux/x64/1.10/julia-1.10.6-linux-x86_64.tar.gz ~/software/
tar xvf ~/software/julia-1.10.6-linux-x86_64.tar.gz

julia can now be launched by typing ~/software/julia-1.10.6/bin/julia. I added the directory with julia to my path by putting the following in~/.bash_profile:

export JULIA_PATH=/glade/u/home/$USER/software/julia-1.10.6/bin
export PATH=$JULIA_PATH:$PATH

I also changed my depot path over to /glade/work/$USER/.julia (by default the depot would reside in $HOME/.julia), which doesn't have as much storage capacity):

export JULIA_DEPOT_PATH=/glade/work/$USER/.julia

Moving the depot into work helps when software downloads big data sets into the depot (like ClimaOcean does).

An example program run with PBS

Next let's test that things work by creating a test project:

mkdir -p ~/TestInterpolate
cd ~/TestInterpolate
touch Project.toml

Next we write some test code that will exercise CUDA-aware MPI communication (the hard thing to get set up):

using Oceananigans
using Oceananigans.Fields: interpolate!
using MPI

# Automatically distributes among available processors
arch = Distributed(GPU())

rank = arch.local_rank
Nranks = MPI.Comm_size(arch.communicator)
println("Hello from process $rank out of $Nranks")

x = y = z = (0, 1)
grid = RectilinearGrid(arch; size=(64, 64, 64), x, y, z, topology=(Periodic, Periodic, Bounded))

@info "The grid on rank $rank:"
@info "$grid"

c = CenterField(grid)
set!(c, (x, y, z) -> x * y^2 * z^3)

@info "c on rank $rank:"
@show c

u = XFaceField(grid)
set!(c, (x, y, z) -> x * y^2 * z^3)
interpolate!(u, c)

@info "u on rank $rank:"
@show u

I put this into a file called ~/TestInterpolate/test_interpolate.jl. Now we're ready to write a bash script that can be submitted to the queue using PBS's qsub. Here's one possible incarnation of such a script:

#!/bin/bash

#PBS -A <YOUR ACCOUNT ID>
#PBS -N test
#PBS -j oe
#PBS -q main
#PBS -l walltime=1:00:00
#PBS -l select=1:ncpus=64:mpiprocs=4:ngpus=4:mem=384GB
#PBS -l gpu_type=a100

# Use moar processes for precompilation to speed things up
export JULIA_NUM_PRECOMPILE_TASKS=64
export JULIA_NUM_THREADS=64

# Load critical modules
module --force purge
module load ncarenv nvhpc cuda cray-mpich

# Utter mystical incantations to perform various miracles
export MPICH_GPU_SUPPORT_ENABLED=1
export JULIA_MPI_HAS_CUDA=true
export PALS_TRANSFER=false
export JULIA_CUDA_MEMORY_POOL=none

# Write down a script that binds MPI processes to GPUs (taken from Derecho documentation)
cat > launch.sh << EoF_s
#! /bin/bash

export MPICH_GPU_SUPPORT_ENABLED=1
export LOCAL_RANK=\${PMI_LOCAL_RANK}
export GLOBAL_RANK=\${PMI_RANK}
export CUDA_VISIBLE_DEVICES=\$(expr \${LOCAL_RANK} % 4)

echo "Global Rank \${GLOBAL_RANK} / Local Rank \${LOCAL_RANK} / CUDA_VISIBLE_DEVICES=\${CUDA_VISIBLE_DEVICES} / \$(hostname)"

exec \$*
EoF_s

chmod +x launch.sh

# Now to make our julia environment work:
# 1. Instantiate (we only need to do this once, but this also may be the first time you are running this code)
julia --project -e 'using Pkg; Pkg.instantiate()'
# 2. Add some packages to the environment that we need to use
julia --project -e 'using Pkg; Pkg.add("MPI"); Pkg.add("MPIPreferences"); Pkg.add("CUDA"); Pkg.add("Oceananigans")'
# 3. Tell MPI that we would like to use the system binary we loaded with module load cray-mpich
julia --project -e 'using MPIPreferences; MPIPreferences.use_system_binary(vendor="cray")'
# 4. Build MPI and CUDA in advance for yucks
julia --project -e 'using MPI; using CUDA; CUDA.precompile_runtime()'

# Finally, let's run this thing
mpiexec -n 4 -ppn 4 ./launch.sh julia --project test_interpolate.jl

Note that you have to put the ID for YOUR Derecho allocation in the above script where it says <YOUR ACCOUNT ID>. (You probably have an email where this is given. I'm trying to figure out how to get it using a PBS command. If you put in a wrong id, you might get an error message that tells you your account ID prefixed by qsub: Invalid account for GPU usage, available accounts:.) I copied the above script into a file called run_derecho_job.sh. Then I submitted it:

qsub run_derecho_job.sh

# output should be something like
6486127.desched1

The current status of the job can be found by typing qstat -u $USER. You can also monitor its progress using the command

watch -n 0.1 qstat -u $USER

Press ctrl-c to exit watch. The output will be piped into a file called test.o*******, where the * are replaced by numbers that represent the job ID.

The first time you launch the job, test.o******* will contain a lot of information about precompilation. It may also contain CUDA errors regarding LD_LIBRARY_PATH (we are trying to figure out if these are an issue or not). The essential part of the output should be at the end of the file and should look something like

┌ Info: MPI implementation identified
│   libmpi = "libmpi_nvidia.so"
│   version_string = "MPI VERSION    : CRAY MPICH version 8.1.29.34 (ANL base 3.4a2)\nMPI BUILD INFO : Tue Feb 20 20:43 2024 (git hash d8ab47f)\n"
│   impl = "CrayMPICH"
│   version = v"8.1.29"
└   abi = "MPICH"
┌ Info: MPIPreferences unchanged
│   binary = "system"
│   libmpi = "libmpi_nvidia.so"
│   abi = "MPICH"
│   mpiexec = "mpiexec"
│   preloads = 1-element Vector{String}: …
└   preloads_env_switch = "MPICH_GPU_SUPPORT_ENABLED"
Global Rank 2 / Local Rank 2 / CUDA_VISIBLE_DEVICES=2 / deg0072
Global Rank 0 / Local Rank 0 / CUDA_VISIBLE_DEVICES=0 / deg0072
Global Rank 1 / Local Rank 1 / CUDA_VISIBLE_DEVICES=1 / deg0072
Global Rank 3 / Local Rank 3 / CUDA_VISIBLE_DEVICES=3 / deg0072
[ Info: Oceananigans will use 64 threads
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: Oceananigans will use 64 threads
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: Oceananigans will use 64 threads
[ Info: MPI has not been initialized, so we are calling MPI.Init().
[ Info: Oceananigans will use 64 threads
[ Info: MPI has not been initialized, so we are calling MPI.Init().
Hello from process 0 out of 4Hello from process 2 out of 4

Hello from process 1 out of 4
Hello from process 3 out of 4
[ Info: The grid on rank 2:
[ Info: The grid on rank 0:
[ Info: The grid on rank 1:
[ Info: The grid on rank 3:
┌ Info: 16×64×64 RectilinearGrid{Float64, FullyConnected, Periodic, Bounded} on Distributed{GPU} with 3×3×3 halo
│ ├── FullyConnected x ∈ [0.5, 0.75) regularly spaced with Δx=0.015625
│ ├── Periodic y ∈ [0.0, 1.0)        regularly spaced with Δy=0.015625
└ └── Bounded  z ∈ [0.0, 1.0]        regularly spaced with Δz=0.015625
┌ Info: 16×64×64 RectilinearGrid{Float64, FullyConnected, Periodic, Bounded} on Distributed{GPU} with 3×3×3 halo
│ ├── FullyConnected x ∈ [0.75, 1.0) regularly spaced with Δx=0.015625
│ ├── Periodic y ∈ [0.0, 1.0)        regularly spaced with Δy=0.015625
└ └── Bounded  z ∈ [0.0, 1.0]        regularly spaced with Δz=0.015625
┌ Info: 16×64×64 RectilinearGrid{Float64, FullyConnected, Periodic, Bounded} on Distributed{GPU} with 3×3×3 halo
│ ├── FullyConnected x ∈ [0.0, 0.25) regularly spaced with Δx=0.015625
│ ├── Periodic y ∈ [0.0, 1.0)        regularly spaced with Δy=0.015625
└ └── Bounded  z ∈ [0.0, 1.0]        regularly spaced with Δz=0.015625
┌ Info: 16×64×64 RectilinearGrid{Float64, FullyConnected, Periodic, Bounded} on Distributed{GPU} with 3×3×3 halo
│ ├── FullyConnected x ∈ [0.25, 0.5) regularly spaced with Δx=0.015625
│ ├── Periodic y ∈ [0.0, 1.0)        regularly spaced with Δy=0.015625
└ └── Bounded  z ∈ [0.0, 1.0]        regularly spaced with Δz=0.015625
[ Info: c on rank 2:
[ Info: c on rank 0:
[ Info: c on rank 3:
[ Info: c on rank 1:
c = 16×64×64 Field{Center, Center, Center} on RectilinearGrid on Distributed{GPU}
├── grid: 16×64×64 RectilinearGrid{Float64, FullyConnected, Periodic, Bounded} on Distributed{GPU} with 3×3×3 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: DistributedCommunication, east: DistributedCommunication, south: Periodic, north: Periodic, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 22×70×70 OffsetArray(::CUDA.CuArray{Float64, 3, CUDA.DeviceMemory}, -2:19, -2:67, -2:67) with eltype Float64 with indices -2:19×-2:67×-2:67
    └── max=0.713645, min=1.47793e-11, mean=0.0520738
c = 16×64×64 Field{Center, Center, Center} on RectilinearGrid on Distributed{GPU}
├── grid: 16×64×64 RectilinearGrid{Float64, FullyConnected, Periodic, Bounded} on Distributed{GPU} with 3×3×3 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: DistributedCommunication, east: DistributedCommunication, south: Periodic, north: Periodic, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 22×70×70 OffsetArray(::CUDA.CuArray{Float64, 3, CUDA.DeviceMemory}, -2:19, -2:67, -2:67) with eltype Float64 with indices -2:19×-2:67×-2:67
    └── max=0.954031, min=2.20552e-11, mean=0.0729033
c = 16×64×64 Field{Center, Center, Center} on RectilinearGrid on Distributed{GPU}
├── grid: 16×64×64 RectilinearGrid{Float64, FullyConnected, Periodic, Bounded} on Distributed{GPU} with 3×3×3 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: DistributedCommunication, east: DistributedCommunication, south: Periodic, north: Periodic, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 22×70×70 OffsetArray(::CUDA.CuArray{Float64, 3, CUDA.DeviceMemory}, -2:19, -2:67, -2:67) with eltype Float64 with indices -2:19×-2:67×-2:67
    └── max=0.232874, min=2.27374e-13, mean=0.0104148
c = 16×64×64 Field{Center, Center, Center} on RectilinearGrid on Distributed{GPU}
├── grid: 16×64×64 RectilinearGrid{Float64, FullyConnected, Periodic, Bounded} on Distributed{GPU} with 3×3×3 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: DistributedCommunication, east: DistributedCommunication, south: Periodic, north: Periodic, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 22×70×70 OffsetArray(::CUDA.CuArray{Float64, 3, CUDA.DeviceMemory}, -2:19, -2:67, -2:67) with eltype Float64 with indices -2:19×-2:67×-2:67
    └── max=0.473259, min=7.50333e-12, mean=0.0312443
[ Info: u on rank 0:
[ Info: u on rank 1:
[ Info: u on rank 2:
[ Info: u on rank 3:
u = 16×64×64 Field{Face, Center, Center} on RectilinearGrid on Distributed{GPU}
├── grid: 16×64×64 RectilinearGrid{Float64, FullyConnected, Periodic, Bounded} on Distributed{GPU} with 3×3×3 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: DistributedCommunication, east: DistributedCommunication, south: Periodic, north: Periodic, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 22×70×70 OffsetArray(::CUDA.CuArray{Float64, 3, CUDA.DeviceMemory}, -2:19, -2:67, -2:67) with eltype Float64 with indices -2:19×-2:67×-2:67
    └── max=0.225362, min=1.13687e-13, mean=0.00978418
u = 16×64×64 Field{Face, Center, Center} on RectilinearGrid on Distributed{GPU}
├── grid: 16×64×64 RectilinearGrid{Float64, FullyConnected, Periodic, Bounded} on Distributed{GPU} with 3×3×3 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: DistributedCommunication, east: DistributedCommunication, south: Periodic, north: Periodic, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 22×70×70 OffsetArray(::CUDA.CuArray{Float64, 3, CUDA.DeviceMemory}, -2:19, -2:67, -2:67) with eltype Float64 with indices -2:19×-2:67×-2:67
    └── max=0.465747, min=3.75167e-12, mean=0.0299628
u = 16×64×64 Field{Face, Center, Center} on RectilinearGrid on Distributed{GPU}
├── grid: 16×64×64 RectilinearGrid{Float64, FullyConnected, Periodic, Bounded} on Distributed{GPU} with 3×3×3 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: DistributedCommunication, east: DistributedCommunication, south: Periodic, north: Periodic, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 22×70×70 OffsetArray(::CUDA.CuArray{Float64, 3, CUDA.DeviceMemory}, -2:19, -2:67, -2:67) with eltype Float64 with indices -2:19×-2:67×-2:67
    └── max=0.706133, min=7.38964e-12, mean=0.0501414
u = 16×64×64 Field{Face, Center, Center} on RectilinearGrid on Distributed{GPU}
├── grid: 16×64×64 RectilinearGrid{Float64, FullyConnected, Periodic, Bounded} on Distributed{GPU} with 3×3×3 halo
├── boundary conditions: FieldBoundaryConditions
│   └── west: DistributedCommunication, east: DistributedCommunication, south: Periodic, north: Periodic, bottom: ZeroFlux, top: ZeroFlux, immersed: ZeroFlux
└── data: 22×70×70 OffsetArray(::CUDA.CuArray{Float64, 3, CUDA.DeviceMemory}, -2:19, -2:67, -2:67) with eltype Float64 with indices -2:19×-2:67×-2:67
    └── max=0.946519, min=1.10276e-11, mean=0.07032

glwagner · 2024-07-30T17:47:09Z

glwagner
Jul 30, 2024
Maintainer Author

@loganpknudsen has already compiled a bit of useful information over at #3655. I think that @tomchor and @simone-silvestri are also using Derecho.

0 replies

glwagner · 2024-10-30T15:25:31Z

glwagner
Oct 30, 2024
Maintainer Author

Want to put this here:

help?> MPI.has_cuda
  MPI.has_cuda()

  Check if the MPI implementation is known to have CUDA support. Currently only Open MPI provides a mechanism to check, so it will return false with other implementations (unless overridden). For "IBMSpectrumMPI" it will
  return true.

  This can be overridden by setting the JULIA_MPI_HAS_CUDA environment variable to true or false.

  │ Note
  │
  │  For OpenMPI or OpenMPI-based implementations you first need to call Init().

  See also MPI.has_rocm for ROCm support.

0 replies

glwagner · 2024-10-31T00:25:43Z

glwagner
Oct 31, 2024
Maintainer Author

It could make sense to build a package that codifies and even automates the process of setting up julia and Oceananigans on Derecho. What do others think about that?

0 replies

simone-silvestri · 2024-10-31T09:04:38Z

simone-silvestri
Oct 31, 2024
Maintainer

Thanks, @glwagner, this is super useful.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using NCAR's Derecho #3669

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using NCAR's Derecho #3669

glwagner Jul 30, 2024 Maintainer

Overview

Scope

Links

Getting started on Derecho with CUDA-Aware MPI

An example program run with PBS

Replies: 4 comments

glwagner Jul 30, 2024 Maintainer Author

glwagner Oct 30, 2024 Maintainer Author

glwagner Oct 31, 2024 Maintainer Author

simone-silvestri Oct 31, 2024 Maintainer

glwagner
Jul 30, 2024
Maintainer

glwagner
Jul 30, 2024
Maintainer Author

glwagner
Oct 30, 2024
Maintainer Author

glwagner
Oct 31, 2024
Maintainer Author

simone-silvestri
Oct 31, 2024
Maintainer