-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to improve neighborhood search #65
Comments
Following up on point 4:
I implemented the same with Pixie, and I got this (also only showing the first 10 steps):
Here's a quick benchmark for different numbers of particles and threads.
It looks like CompactNSearch is scaling pretty nicely up to 16 threads (using OpenMP), but is still a lot slower than my serial implementation, surprisingly even for O(1M) particles. So I guess our current NHS implementation is actually pretty good, even for larger simulations. It seems like there is absolutely no need to implement anything new until we get to the point where we want to run Pixie on clusters. |
I'll close this issue for now, but we may want to reopen it when we start with a distributed-memory implementation. |
Currently, the NHS update is a bottleneck of the simulation (see #37).
On one thread, it's reasonable fast and takes less than 10% of the runtime. However, the update process is not parallelized, so on 24 threads, up to 50% of the total simulation time is spent in the NHS update.
I've spent the last few days trying to make either the initialization or the update scale with the number of threads.
Unfortunately, I failed and couldn't even find an implementation that's faster on 24 threads than the current serial implementation.
I ultimately ended up with #64, a 7 line change to make the serial update faster when only a few particles move cells.
The problem is that I can't build the hash table data structure in parallel. This is a
Dict{NTuple{NDIMS, Int}, Vector{Int}}
, which for each cell (referenced by cell coordinates, e.g.(10, 13)
) contains a vector of all particle IDs in this cell.The initialization process basically looks like this:
We can't make this loop threaded, since
append!
is not thread-safe.I tried the following things to parallelize this code:
@threaded
), so that each cell is filled by only one thread, and we don'tappend!
to the same vector from multiple threads. I couldn't even find a way to get a unique list of cells to loop over in the time in which the serial code above terminates.mergewith!
, I end up being only slightly faster than the serial code.I tried to find a library that does a parallel
reduce!
, but I couldn't find anything useful. I tried to manually write a parallel reduce:mergewith!
above.Right now, the only fast way I found to iterate over only one hash table is with generators and
Iterators.flatten
to avoid allocations. The code currently looks like this:They sort the particles by cell index to avoid storing a list of particles altogether, and just store the starting index for each cell. This seems to work only for large numbers of particles (like O(1M)), since just the sorting process alone takes more time than my serial implementation above for O(10k) particles. The only multithreaded sorting function I found for Julia is from the ThreadsX.jl package, but this only becomes faster for arrays of the size O(100k) and has a huge overhead for O(10k) particles.
For O(1M) particles, it will probably make sense to go with option 5, but we will need MPI or a GPU implementation to be able to afford such simulations, so right now, we won't benefit from option 5 at all.
For our usual simulation sizes, I am completely out of ideas. Suggestions welcome.
The text was updated successfully, but these errors were encountered: