Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate causes of poor scaling in multi-GPU runs #2222

Open
simonbyrne opened this issue Oct 9, 2023 · 12 comments
Open

Investigate causes of poor scaling in multi-GPU runs #2222

simonbyrne opened this issue Oct 9, 2023 · 12 comments
Assignees

Comments

@simonbyrne
Copy link
Member

It seems to primarily be driven by GC. Need to look at memory allocations, and mechanism to synchronize the garbage collector.

cf earlier discussion here #686

@simonbyrne
Copy link
Member Author

Looks like there were a couple of issues:

  1. I didn't request an extra CPU core for the profiler
  2. I didn't request enough memory, so GC was getting triggered more often.

Fixing those, and specifying a higher GC frequency fixes the GC pauses
https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/111#018b3c1f-a277-4683-b446-13423f7ea108

However we're still getting stuck in epoll_wait before the MPI communication starts:

Screenshot 2023-10-17 at 10 33 57 AM

This appears to be JuliaGPU/CUDA.jl#1910. This is fixed in CUDA.jl 5 (JuliaGPU/CUDA.jl#2025), but unfortunately we can't upgrade yet (CliMA/ClimaCore.jl#1500)

@simonbyrne
Copy link
Member Author

simonbyrne commented Jan 11, 2024

Update: we are still seeing some cases where CPU cores are idle, which causes 3ms-6ms delays.

Current plan (discussed with @sriharshakandala @bloops)

  • See if we have the same issue on the HPC cluster?
  • Try disabling GC altogether?
  • Investigate thread pinning, see ThreadPinning.jl
  • Split hyperdiffusion across multiple threads/streams

@simonbyrne
Copy link
Member Author

I tried using JULIA_EXCLUSIVE=1, but it gave worse results. My suspicion is that it could be due to hyperthreads, will need to investigate further.

@charleskawczynski
Copy link
Member

Is there a reproducer for this? Or a job ID that we can add for reference?

@charleskawczynski
Copy link
Member

We use the blocking by default now: https://github.com/search?q=repo%3ACliMA%2FClimaCore.jl%20blocking&type=code.

@charleskawczynski
Copy link
Member

Reproducer is on GPU target pipeline

@simonbyrne
Copy link
Member Author

Okay, so here is what I've learnt:

On the Slurm side

On the Julia side

@simonbyrne
Copy link
Member Author

One other opportunity for improvement:

Our current DSS operation looks something like

  1. launch fill send buffer kernels
  2. CUDA.synchronize()
  3. MPI.Startall(...)
  4. launch internal dss kernels
  5. MPI.Waitall(...)
  6. launch exterior kernels

The problem is that the GPU is completely idle during 3, and during the launch latency of 4:

Screenshot 2024-01-26 at 10 44 46 AM

Instead of synchronizing the whole stream, we could instead synchronize via events:

  1. launch fill send buffer kernels
  2. CUDA.record(send_event)
  3. launch internal dss kernels
  4. CUDA.synchronize(send_event)
  5. MPI.Startall(...)
  6. MPI.Waitall(...)
  7. launch exterior kernels

In this way, the internal dss kernels can run during MPI communication.

@simonbyrne
Copy link
Member Author

simonbyrne commented Jan 26, 2024

Oh, and also it appears that thread pinning on clima is a net negative. It causes occasional ~10-20ms when the OS thread scheduler kicks in:
Screenshot 2024-01-25 at 9 16 34 PM

On the other hand, as long as we use Slurm thread binding (but not process thread pinning) with a sufficient number of threads (in this case, 4 hardware threads assigned to 3 julia threads), we do see occasional very short (20us) pauses, but it then switches to a new hardware thread, having very little net effect (notice the change in color):
Screenshot 2024-01-26 at 10 05 14 AM

I've updated our GPU pipeline in #2585

@charleskawczynski
Copy link
Member

CliMA/ClimaTimeSteppers.jl#260 should help with scaling by reducing the number of DSS calls (we'll be eliminating 4 per timestep).

@charleskawczynski
Copy link
Member

Upgrading CUDA and Adapt, plus JuliaGPU/Adapt.jl#78 will reduce allocations for GPU runs by a factor of ~15, which may help reduce GC pressure. We should be able to reduce the frequency of GC calls after this update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants