Investigate causes of poor scaling in multi-GPU runs #2222

simonbyrne · 2023-10-09T23:55:08Z

It seems to primarily be driven by GC. Need to look at memory allocations, and mechanism to synchronize the garbage collector.

cf earlier discussion here #686

simonbyrne · 2023-10-17T17:37:08Z

Looks like there were a couple of issues:

I didn't request an extra CPU core for the profiler
I didn't request enough memory, so GC was getting triggered more often.

Fixing those, and specifying a higher GC frequency fixes the GC pauses
https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/111#018b3c1f-a277-4683-b446-13423f7ea108

However we're still getting stuck in epoll_wait before the MPI communication starts:

This appears to be JuliaGPU/CUDA.jl#1910. This is fixed in CUDA.jl 5 (JuliaGPU/CUDA.jl#2025), but unfortunately we can't upgrade yet (CliMA/ClimaCore.jl#1500)

simonbyrne · 2024-01-11T17:30:07Z

Update: we are still seeing some cases where CPU cores are idle, which causes 3ms-6ms delays.

Current plan (discussed with @sriharshakandala @bloops)

See if we have the same issue on the HPC cluster?
Try disabling GC altogether?
Investigate thread pinning, see ThreadPinning.jl
Split hyperdiffusion across multiple threads/streams

simonbyrne · 2024-01-13T00:07:52Z

Notes on thread pinning:

simonbyrne · 2024-01-13T00:28:06Z

I tried using JULIA_EXCLUSIVE=1, but it gave worse results. My suspicion is that it could be due to hyperthreads, will need to investigate further.

charleskawczynski · 2024-01-22T17:44:36Z

Is there a reproducer for this? Or a job ID that we can add for reference?

charleskawczynski · 2024-01-23T21:46:51Z

We use the blocking by default now: https://github.com/search?q=repo%3ACliMA%2FClimaCore.jl%20blocking&type=code.

charleskawczynski · 2024-01-25T17:44:47Z

Reproducer is on GPU target pipeline

simonbyrne · 2024-01-26T00:21:48Z

Okay, so here is what I've learnt:

On the Slurm side

you need to specify --cpus-per-task=n in both the sbatch and srun (the sbatch ones used to be automatically forwarded, but not any more, see https://groups.google.com/g/slurm-users/c/JQRgrKaKCcw/m/hpZtXOfwEQAJ
you need to specify --cpu-bind=threads in srun.

On the Julia side

due to JULIA_EXCLUSIVE with auto threads errors on 1.10, mac/arm64 JuliaLang/julia#50702, you need to specify fewer Julia threads than CPUs
if using nsys profile, it can be helpful to leave a core for that as well.

simonbyrne · 2024-01-26T18:45:44Z

One other opportunity for improvement:

Our current DSS operation looks something like

launch fill send buffer kernels
CUDA.synchronize()
MPI.Startall(...)
launch internal dss kernels
MPI.Waitall(...)
launch exterior kernels

The problem is that the GPU is completely idle during 3, and during the launch latency of 4:

Instead of synchronizing the whole stream, we could instead synchronize via events:

launch fill send buffer kernels
CUDA.record(send_event)
launch internal dss kernels
CUDA.synchronize(send_event)
MPI.Startall(...)
MPI.Waitall(...)
launch exterior kernels

In this way, the internal dss kernels can run during MPI communication.

simonbyrne · 2024-01-26T22:21:38Z

Oh, and also it appears that thread pinning on clima is a net negative. It causes occasional ~10-20ms when the OS thread scheduler kicks in:

On the other hand, as long as we use Slurm thread binding (but not process thread pinning) with a sufficient number of threads (in this case, 4 hardware threads assigned to 3 julia threads), we do see occasional very short (20us) pauses, but it then switches to a new hardware thread, having very little net effect (notice the change in color):

I've updated our GPU pipeline in #2585

charleskawczynski · 2024-03-04T17:16:37Z

CliMA/ClimaTimeSteppers.jl#260 should help with scaling by reducing the number of DSS calls (we'll be eliminating 4 per timestep).

charleskawczynski · 2024-03-04T17:17:59Z

Upgrading CUDA and Adapt, plus JuliaGPU/Adapt.jl#78 will reduce allocations for GPU runs by a factor of ~15, which may help reduce GC pressure. We should be able to reduce the frequency of GC calls after this update.

simonbyrne self-assigned this Oct 9, 2023

simonbyrne added this to the O1.1.3 Demonstrate strong scaling on GPU milestone Oct 9, 2023

simonbyrne mentioned this issue Oct 30, 2023

update to latest ClimaCore + CUDA, fix GC issues in pipeline #2312

Closed

1 task

charleskawczynski mentioned this issue Jan 26, 2024

Add option for disabling progress logging #2582

Merged

szy21 assigned sriharshakandala Jan 29, 2024

charleskawczynski self-assigned this Feb 13, 2024

cmbengue unassigned simonbyrne May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate causes of poor scaling in multi-GPU runs #2222

Investigate causes of poor scaling in multi-GPU runs #2222

simonbyrne commented Oct 9, 2023

simonbyrne commented Oct 17, 2023

simonbyrne commented Jan 11, 2024 •

edited

Loading

simonbyrne commented Jan 13, 2024

simonbyrne commented Jan 13, 2024

charleskawczynski commented Jan 22, 2024

charleskawczynski commented Jan 23, 2024

charleskawczynski commented Jan 25, 2024

simonbyrne commented Jan 26, 2024

simonbyrne commented Jan 26, 2024

simonbyrne commented Jan 26, 2024 •

edited

Loading

charleskawczynski commented Mar 4, 2024

charleskawczynski commented Mar 4, 2024

Investigate causes of poor scaling in multi-GPU runs #2222

Investigate causes of poor scaling in multi-GPU runs #2222

Comments

simonbyrne commented Oct 9, 2023

simonbyrne commented Oct 17, 2023

simonbyrne commented Jan 11, 2024 • edited Loading

simonbyrne commented Jan 13, 2024

simonbyrne commented Jan 13, 2024

charleskawczynski commented Jan 22, 2024

charleskawczynski commented Jan 23, 2024

charleskawczynski commented Jan 25, 2024

simonbyrne commented Jan 26, 2024

simonbyrne commented Jan 26, 2024

simonbyrne commented Jan 26, 2024 • edited Loading

charleskawczynski commented Mar 4, 2024

charleskawczynski commented Mar 4, 2024

simonbyrne commented Jan 11, 2024 •

edited

Loading

simonbyrne commented Jan 26, 2024 •

edited

Loading