Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The MPI we use in the distributed tests is not CUDA-aware #3897

Open
simone-silvestri opened this issue Nov 4, 2024 · 12 comments · May be fixed by #3880
Open

The MPI we use in the distributed tests is not CUDA-aware #3897

simone-silvestri opened this issue Nov 4, 2024 · 12 comments · May be fixed by #3880

Comments

@simone-silvestri
Copy link
Collaborator

Somewhere between this commit
https://buildkite.com/clima/oceananigans-distributed/builds/3113#01917ace-fe81-401d-ba21-467037e6aead
and main, we switched from using libmpitrampoline.so in the distributed tests to libmpi.so downloaded from the artifacts.

Previously, the mpi trampoline was loading a CUDA-aware implementation of Open MPI, while the libmpi.so we use now is a
MPICH implementation non CUDA-aware:
https://buildkite.com/clima/oceananigans-distributed/builds/4227#0192f70a-b947-4d38-bd1c-c2497a964de9

This makes our GPU distributed tests fail.
I am wondering where this switch happened because I couldn't trace any changes to the code. @Sbozzolo, do you know if something changed in the LocalPreferences.toml in the Caltech cluster?

@Sbozzolo
Copy link
Member

Sbozzolo commented Nov 4, 2024

Preferences are loaded by the climacommon module, which has seen a new release over the past few months. This release was to move to Julia 1.11 and there was no change with respect to the preferences.

In general, you shouldn't set any preference when running on the Caltech clusters because everything is set for you by the module system.

@glwagner
Copy link
Member

glwagner commented Nov 4, 2024

This happened: #3838

@glwagner
Copy link
Member

glwagner commented Nov 4, 2024

And CUDA runtime wasn't found in that PR: https://buildkite.com/clima/oceananigans-distributed/builds/4038#0192c76f-d6ea-4e48-a7fd-f1b22df9f89f/189-1063

so we just need to look at the PR before that...

PS @Sbozzolo we realized there was a problem with the way we ran the tests that would allow the GPU tests to pass even if they didn't run on GPU

@glwagner
Copy link
Member

glwagner commented Nov 4, 2024

Ok, I think this is the problematic PR: #3783

@glwagner
Copy link
Member

glwagner commented Nov 4, 2024

@glwagner
Copy link
Member

glwagner commented Nov 4, 2024

There is something a little odd that we are using 2024_10_09:

modules: climacommon/2024_10_09

But ClimaAtmos is on 2024_10_08 if I am reading this right

https://github.com/CliMA/ClimaAtmos.jl/blob/a0e8612fd602ff33349e46ed34875ed8af45fd3a/.buildkite/pipeline.yml#L4

@simone-silvestri
Copy link
Collaborator Author

I can try reintegrating a Manifest using julia 1.11 in #3880 to see if it makes a difference

@glwagner
Copy link
Member

glwagner commented Nov 5, 2024

We dont' support 1.11 yet though so its not a long term solution...

@simone-silvestri
Copy link
Collaborator Author

Hmmm, ok I guess we probably have to revert to a previous version of climacommon

@Sbozzolo
Copy link
Member

Sbozzolo commented Nov 5, 2024

@Sbozzolo am I reading this right that ClimaAtmos does not (always) use climacommon?

https://github.com/CliMA/ClimaAtmos.jl/blob/a0e8612fd602ff33349e46ed34875ed8af45fd3a/.buildkite/gpu_pipeline/pipeline.yml#L4C55-L4C68

EDIT I suspect these are unused in favor of https://github.com/CliMA/ClimaAtmos.jl/blob/main/.buildkite/pipeline.yml

That's a different machine.

2024_10_09 is Julia 1.11, 2024_10_08 is Julia 1.10. Some of our repos moved to 1.11, ClimaAtmos is still on 1.10 for some issues in ClimaCore.

If you don't support 1.11, you should stay stay on _10_08.

@simone-silvestri
Copy link
Collaborator Author

I think we are also hitting this problem JuliaParallel/MPI.jl#715
because it looks like the MPIPreferences are correctly loaded at the

julia -O0 --project -e 'using Pkg; Pkg.instantiate()`

but then it loads a completely different MPI in the

julia -O0 --project -e 'using Pkg; Pkg.test()`

step

@glwagner
Copy link
Member

glwagner commented Nov 6, 2024

I think we are also hitting this problem JuliaParallel/MPI.jl#715 because it looks like the MPIPreferences are correctly loaded at the

julia -O0 --project -e 'using Pkg; Pkg.instantiate()`

but then it loads a completely different MPI in the

julia -O0 --project -e 'using Pkg; Pkg.test()`

step

Nice observation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants