-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate causes of poor scaling in multi-GPU runs #2222
Comments
Looks like there were a couple of issues:
Fixing those, and specifying a higher GC frequency fixes the GC pauses However we're still getting stuck in This appears to be JuliaGPU/CUDA.jl#1910. This is fixed in CUDA.jl 5 (JuliaGPU/CUDA.jl#2025), but unfortunately we can't upgrade yet (CliMA/ClimaCore.jl#1500) |
Update: we are still seeing some cases where CPU cores are idle, which causes 3ms-6ms delays. Current plan (discussed with @sriharshakandala @bloops)
|
I tried using |
Is there a reproducer for this? Or a job ID that we can add for reference? |
We use the blocking by default now: https://github.com/search?q=repo%3ACliMA%2FClimaCore.jl%20blocking&type=code. |
Reproducer is on GPU target pipeline |
Okay, so here is what I've learnt: On the Slurm side
On the Julia side
|
Oh, and also it appears that thread pinning on On the other hand, as long as we use Slurm thread binding (but not process thread pinning) with a sufficient number of threads (in this case, 4 hardware threads assigned to 3 julia threads), we do see occasional very short (20us) pauses, but it then switches to a new hardware thread, having very little net effect (notice the change in color): I've updated our GPU pipeline in #2585 |
CliMA/ClimaTimeSteppers.jl#260 should help with scaling by reducing the number of DSS calls (we'll be eliminating 4 per timestep). |
Upgrading CUDA and Adapt, plus JuliaGPU/Adapt.jl#78 will reduce allocations for GPU runs by a factor of ~15, which may help reduce GC pressure. We should be able to reduce the frequency of GC calls after this update. |
It seems to primarily be driven by GC. Need to look at memory allocations, and mechanism to synchronize the garbage collector.
cf earlier discussion here #686
The text was updated successfully, but these errors were encountered: