Replies: 27 comments
-
Would be good to do some profiling (probably with a system profiler like perf), to understand where time is spent. The kernels using KernelAbstractions are automatically multi-threaded. |
Beta Was this translation helpful? Give feedback.
-
To fill in a few more details for @hennyg888 --- almost all multithreading in Oceananigans is achieved via More specifically, all tendency evaluations, non-communicative / non-periodic halo fills (periodic halo filling uses Base broadcasting and thus is not parallelized), integrals (like the hydrostatic pressure integral, or vertical velocity computation in Oceananigans.jl/src/Utils/kernel_launching.jl Lines 71 to 90 in 6e39d3f The line event = loop!(args...; dependencies=dependencies, kwargs...) launches a kernel, using So either we can improve multithreading by changing what happens when |
Beta Was this translation helpful? Give feedback.
-
@hennyg888 do you have the same problems using MPI instead of multi-threaded, and on the same CPU ( Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz ). ? |
Beta Was this translation helpful? Give feedback.
-
For MPI I ran it on up to 128 Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz CPUs with efficiencies at around 80%. I think I have some results for MPI weak and strong scaling benchmarks posted here at the bottom #1722. |
Beta Was this translation helpful? Give feedback.
-
Thanks everyone for your feedback. @vchuravy , great to know that multi-threading is built in! I agree that profiling would be a good way to determine why we get not great efficiency. I have not used perf but we can look into it. Also, do you know of benchmarking others have done using |
Beta Was this translation helpful? Give feedback.
-
@hennyg888 and @francispoulin the results in #1722 look like they may be for Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz ? (and a different version of Julia 1.6.0 v 1.6.1 ). Not sure how precise we want to be on what we compare with what, but it could be informative to have comparisons where only one thing is changed at a time if that is possible i.e. all run on Intel(R) Xeon(R) Platinum 8260 CPU with same problem and problem but only threading v MPI different? We could also compare across CPU and across Julia but not all at the same time? |
Beta Was this translation helpful? Give feedback.
-
Not sure why we have julia v 1.6.1 but I think that we should be able to do the results for julia 1.6.0, since that's what we have on the servers. When we do runs over hundreds of cpus I don't know that we will be getting cpus that are all the same. Unfortunately, I don't see an easy fix for that. |
Beta Was this translation helpful? Give feedback.
-
@francispoulin (and @hennyg888 ) no worries. We can use what we have too. I think both these tests ( #1861 and #1722 ) are on a single CPU (just lots of cores)? |
Beta Was this translation helpful? Give feedback.
-
Sorry, I was thinking of the MPI tests (since that's what I'm looking at for the slides right now). I agree that for one CPU vs one GPU, it would be nice to use the same CPU and GPU in the different tests. I know we can specify the GPU type in the SLURM script. Maybe we can do the sme for the CPU? |
Beta Was this translation helpful? Give feedback.
-
@francispoulin and @hennyg888 do you think a metric of "number of points per second" would be useful? In general that would be Nx.Ny.Nz.Nt/tbench . That could be a way to compare 1 GPU with 128 CPU cores on the same model but with different problem sizes? |
Beta Was this translation helpful? Give feedback.
-
I did some benchmarks in the beginning, but mostly focused on strong scaling. |
Beta Was this translation helpful? Give feedback.
-
Interesting idea @christophernhill . For the last results that @hennyg888 posted in #1722, I did some calculations and found the following.
In an article that @ali-ramadhan referenced on the slack channel recently, a paper using a shallow water model in python, Roullet and Gaillard (2021), said they were getting 2 TFlops per second using a thousand cores. We are getting 3 GigaFlops on GPU and 9 MegaFlops. Certainly very good speedup since we have O(400) with But to answer your question, when @hennyg888 has the data, we can certainly produce these plots easily enough (unless there is a problem that I'm missing). |
Beta Was this translation helpful? Give feedback.
-
Thanks for the information. Can you point me to where some of these results might be found? |
Beta Was this translation helpful? Give feedback.
-
@christophernhill @francispoulin I ran the threaded benchmarks up to 32 threads on 32 cores with Julia 1.6.0 and on the same CPUs as what the MPI benchmarks used. Makes sense since they're all benchmarking parallel computing efficiency.
Also, after reviewing the new benchmarks and comparing them to the old benchmarks currently displayed on |
Beta Was this translation helpful? Give feedback.
-
We have to do more work to compare with Roullet and Gaillard (2021). First of all, there are typos in the paper: sometimes the performance is listed as 2 GFlops, other times as 2 TFlops. Second --- if I understand the situation correctly --- I don't think we've ever measured floating point operations per second. The numbers you've calculated are grid points per second; however we do many floating point operations per grid point. Roullet and Gaillard (2021) estimate their code performs something like 700-800 Flops per grid point. |
Beta Was this translation helpful? Give feedback.
-
@glwagner we could look at using - https://github.com/triscale-innov/GFlops.jl at some point. P.S 84% of CPU seems abnormally high, dense matrix/matrix typically maxes out at about 80%. |
Beta Was this translation helpful? Give feedback.
-
Good points @glwagner . The numbers that I posted are probably best ignored for now. I imagine this should come up in another issue when we are concerned about efficiency of the calculations in general. Focusing in the threading in this issue seems best. |
Beta Was this translation helpful? Give feedback.
-
Sounds like a letter to the editor. :-P |
Beta Was this translation helpful? Give feedback.
-
I put together some utilities for testing multithreading versus Base.threads for a simple kernel: https://github.com/glwagner/multithreaded-stencils I've used a new repo because it might be worthwhile to test threaded computations in other programming languages. |
Beta Was this translation helpful? Give feedback.
-
Could be they meant 84% of memory bandwidth limited peak? It isn't crazy to get 84% of memory bandwidth, but that then gives a very low % peak flops. I haven't read article, I guess I should! |
Beta Was this translation helpful? Give feedback.
-
Very nice work @glwagner , and thanks for making this. Lots of good stuff here. In your calculations, you find that there is saturation at 16 threads. I might guess that you have 16 cores on one node? I would think that this should be node dependent. Also, in the table, might it be possible to compute the efficiency as well? I think that's more standard than speed up. |
Beta Was this translation helpful? Give feedback.
-
Ah, this machine has 48 cores. Since threading has an overhead cost, we expect saturation at some point. It's surprising that this happens at just 16 cores for such a large problem (512^3) though. We can calculate more metrics for sure. I think it would be worthwhile to investigate whether other threading paradigms scale differently for the same problem. Numba + parallel accelerator might be a good test case. @hennyg888 would you be interested in that? Here are some docs: https://numba.pydata.org/numba-doc/latest/user/parallel.html |
Beta Was this translation helpful? Give feedback.
-
I agree that I would expect it to saturate at higher than 16 if there were 48 cores, but clearly I'm wrong. Getting another benchmark would be a good idea. I'm happy to consider the numba + parallel idea since that would be good to test the architecture. This mini-course did give some threaded examples to solve the diffusion equation in 3D. I wonder if we might want to ask Ludovic if they have done any scalings for multi-threading? I'm happy to discuss this with @hennyg888 on Monday and see what we come up with. Others are happy to join the discussion if they like. |
Beta Was this translation helpful? Give feedback.
-
Below is a link to a paper that compares the scalability of multi-threading in Python, Julia and Chapel. Brief Summary: They find that none of them do as well as OpenMP but give some reasons as to why. But they do find some improvements going up to 64 threads, but the effiicency in some cases go down to 20%. It seems that Python might do better on low numbers of threads but Julia does better on more. This was last year so I am sure this should probably redone. Also, I should mention I don't believe their problem is like ours but it's an example and has some pictures, so that's nice to see. |
Beta Was this translation helpful? Give feedback.
-
You run out of memory bandwidth at some point - usually before you get to saturate all the cores for something I guess we could get even more minimalist and check a multi-threaded stream benchmark to see that? |
Beta Was this translation helpful? Give feedback.
-
I am open to trying whatever simple example you suggest @christophernhill , but I'm not sure what you mean by stream benchmark. Sorry. |
Beta Was this translation helpful? Give feedback.
-
Thanks @hennyg888 for the benchmarks! |
Beta Was this translation helpful? Give feedback.
-
I recently ran some benchmarks on threading for Oceananigans based on scripts added by @francispoulin in an older branch.
https://github.com/CliMA/Oceananigans.jl/blob/fjp/multithreaded-benchmarks/benchmark/weak_scaling_shallow_water_model_threaded.jl
https://github.com/CliMA/Oceananigans.jl/blob/fjp/multithreaded-benchmarks/benchmark/weak_scaling_shallow_water_model_serial.jl
Besides the benchmark scripts themselves, everything else was up to date with the latest version of master.
Here are the results:
They're not terrific, but they're decent. I am running these on 32 CPUs, so what I assume is 1 thread per CPU up to 32 threads. The slight increase in efficiency going from 2 to 4 threads is likely some flat overhead being overcome by actual efficiency increase of multithreading.
@christophernhill @glwagner is there anything we can do to improve multithreading efficiency for Oceananigans? It might not be as simple as adding
@threads
in front of the main for loops but with just a little bit of improvement then multithreading efficiency might just match MPI efficiency.As it is, multithreading is already a worthwhile option to achieve speedups on systems with multiple CPUs but no MPI.
So far I've only run the scripts on one node up to 32 threads and CPUs. I'll update this issue with the result of running it on multiple nodes going up to 64 or maybe 128 CPUs just to see if efficiency is affected going from one node to more.
Beta Was this translation helpful? Give feedback.
All reactions