Can we make Oceananigans + MPI less painful? #2345

glwagner · 2022-03-13T15:24:30Z

glwagner
Mar 13, 2022
Maintainer

I'm hoping we can have a general conversation about using Oceananigans and Julia with MPI and maybe glean some tips and tricks from people. Ease of use issues span from

Simply building MPI.jl on various platforms (often not trivial!)
Debugging distributed setups (eg, does tmpi help? To use this we have to setup both tmpi and MPI.jl with the same MPI implementation)
Just running distributed simulations. Do we like the API?
Analyzing data. For example, we don't have any features to help with output from distributed models (the best we can do now is to get a different file for every rank) and analyzing / collating output from a distributed run.

Many of the developers use GPUs for research, which has allowed us to put this issue off for a bit. But distributing across CPUs is an important use case too (note that distributing across GPUs is not performant right now because Distributed does not use buffered communication needed for performant CUDA-aware communication between GPU devices, but a solution to that is in progress at #2253).

glwagner · 2022-03-13T16:54:31Z

glwagner
Mar 13, 2022
Maintainer Author

A note on tmpi: though I used to be able to use this on my Mac laptop, I don't seem to be able to use it anymore. Some notes from my attempts:

I couldn't get the master branch to work on my Mac (which seems to support mpich).
One issue (not the only one) is that mktemp --dry-run doesn't work on Mac (we need to use mktemp -u instead).
commit 67d9c294d59b241648f3b17ac16bcde395d84516, which doesn't support mpich doesn't work with openmpi for me (at least not out of the box) either

I can make it work on tartarus though (an Ubuntu system). So that's good:

0 replies

johnryantaylor · 2022-03-13T19:45:07Z

johnryantaylor
Mar 13, 2022

Thanks for opening the discussion Greg!

A few thoughts:

The GPUs seem great for small/mid-sized problems. E.g. you're all doing great work on the turbulent BL parameterization problem.
There are lots of interesting science problems that require more memory than individual GPUs have (I realize that multi-GPU parallelism could fix this to some degree).
I think that the cost/availability of GPUs might be an entry barrier to some.
In the case of our group I have access to a few GPUs, but much more ready access to large numbers of CPU cores.

My 2 cents are that it would be great to have MPI parallelism 'operational'. I'm not trying to say that anyone should make this a priority of what they are already working on since it looks like there is a lot of great work going on and many directions to go. Perhaps we can help contribute to the distributed code (although it will take us some time to learn the code before I/we can contribute in a meaningful way).

0 replies

johnryantaylor · 2022-03-13T19:51:06Z

johnryantaylor
Mar 13, 2022

I'm still getting an error when I run the distributed nonhydrostatic benchmark and test scripts (but not the distributed shallow water ones). I'll create an issue and post the output there so that we can keep this discussion general

0 replies

glwagner · 2022-03-13T22:22:26Z

glwagner
Mar 13, 2022
Maintainer Author

My 2 cents are that it would be great to have MPI parallelism 'operational'.

Agreed! I don't think all that much more work is needed for a reasonable CPU parallelism either (mostly because of the awesomeness of PencilFFTs.jl). @christophernhill might remember better but I think we found the code scales decently on CPUs.

Worth noting too that the work we're doing to implement a buffered communication abstraction for #2253 and "fusing" the halo-filling kernels / fusing communication via #2335 might also make distributed models more performant (we'll see).

1 reply

francispoulin Mar 14, 2022
Collaborator

Last summer, we did some tests on the scalability using mutliple CPUs. We found with that with 128 cores we could get efficients of 80% or so. Some of the results are here.

However, i see in some results the efficients are down to 10%. Not sure what happened there but maybe it's time to run this again?

We can have good scalability and it's a great idea to make it friendlier so people can use it more easily.

christophernhill · 2022-03-14T15:15:11Z

christophernhill
Mar 14, 2022
Maintainer

@johnryantaylor as @francispoulin / @glwagner note, there is some good scalability hiding in there! We don't have a lot of good regressions tests to check for things that might interfere, or examples to follow. If you are happy to share your views on a couple of possible useful setups - we could use something based on those as good reference things, that we could maintain tests for and end to end MPI setup examples against.

0 replies

raphaelouillon · 2022-05-05T12:17:20Z

raphaelouillon
May 5, 2022

Hi @johnryantaylor @francispoulin @glwagner. I share @johnryantaylor's comments on having access to a large number of CPUs and being generally memory-limited on GPU for large-scale problems. I (naively) started running simulations using up to 12 threads on a local machine after seeing the weak scaling results that @francispoulin mentioned above. This week I have been interested in a setup that requires more memory than can fit on a Tesla V100, so I did some tests running on the Stampede2 supercomputer, on the ICX nodes (Icy Lake nodes with 2*40 cores and 160 hardware threads total). If this works reasonably well, would it be interesting to get some basic weak/strong scaling tests on a "real life" application (including outputting large 3D data sets at regular intervals, for instance)?

1 reply

francispoulin May 5, 2022
Collaborator

I agree that GPUs are wonderful but one GPU does have a bottleneck in terms of the memory. This is not so much of a problem with MPI.

Last summer, we ran some tests and found good scaling on hundreds of cores. The code has evolved since then and it would be worthwhile redoing some of the tests, to see how hte performance is. If you wanted to try this then I would suggest looking at the following file.

https://github.com/CliMA/Oceananigans.jl/blob/main/benchmark/distributed_nonhydrostatic_model_mpi.jl

I for one would be very interested to see what we get with the current version of the repo.

Just to be clear, this does not use threading but uses MPI, which tends to scale better on larger numbers of cores (where large could be > 16?).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we make Oceananigans + MPI less painful? #2345

{{title}}

Replies: 6 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Can we make Oceananigans + MPI less painful? #2345

glwagner Mar 13, 2022 Maintainer

Replies: 6 comments · 2 replies

glwagner Mar 13, 2022 Maintainer Author

johnryantaylor Mar 13, 2022

johnryantaylor Mar 13, 2022

glwagner Mar 13, 2022 Maintainer Author

francispoulin Mar 14, 2022 Collaborator

christophernhill Mar 14, 2022 Maintainer

raphaelouillon May 5, 2022

francispoulin May 5, 2022 Collaborator

glwagner
Mar 13, 2022
Maintainer

Replies: 6 comments 2 replies

glwagner
Mar 13, 2022
Maintainer Author

johnryantaylor
Mar 13, 2022

johnryantaylor
Mar 13, 2022

glwagner
Mar 13, 2022
Maintainer Author

francispoulin Mar 14, 2022
Collaborator

christophernhill
Mar 14, 2022
Maintainer

raphaelouillon
May 5, 2022

francispoulin May 5, 2022
Collaborator