Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to improve neighborhood search #65

Closed
efaulhaber opened this issue Feb 15, 2023 · 2 comments
Closed

How to improve neighborhood search #65

efaulhaber opened this issue Feb 15, 2023 · 2 comments

Comments

@efaulhaber
Copy link
Member

efaulhaber commented Feb 15, 2023

Currently, the NHS update is a bottleneck of the simulation (see #37).
On one thread, it's reasonable fast and takes less than 10% of the runtime. However, the update process is not parallelized, so on 24 threads, up to 50% of the total simulation time is spent in the NHS update.

I've spent the last few days trying to make either the initialization or the update scale with the number of threads.
Unfortunately, I failed and couldn't even find an implementation that's faster on 24 threads than the current serial implementation.
I ultimately ended up with #64, a 7 line change to make the serial update faster when only a few particles move cells.

The problem is that I can't build the hash table data structure in parallel. This is a Dict{NTuple{NDIMS, Int}, Vector{Int}}, which for each cell (referenced by cell coordinates, e.g. (10, 13)) contains a vector of all particle IDs in this cell.
The initialization process basically looks like this:

empty!(hashtable)

for particle in eachparticle(container)
    cell_coords = get_cell_coords(particle)

    # Add particle to corresponding cell or create cell if it does not exist
    if haskey(hashtable, cell_coords)
        append!(hashtable[cell_coords], particle)
    else
        hashtable[cell_coords] = [particle]
    end
end

We can't make this loop threaded, since append! is not thread-safe.
I tried the following things to parallelize this code:

  1. Loop over the cells (with @threaded), so that each cell is filled by only one thread, and we don't append! to the same vector from multiple threads. I couldn't even find a way to get a unique list of cells to loop over in the time in which the serial code above terminates.
  2. Fill a separate hash table for each thread, and then merge them into one. This basically looks like this:
    @threaded for particle in eachparticle(container)
        threadlocal = hashtables[Threads.threadid()]
    
        # Get cell index of the particle's cell
        cell_coords = get_cell_coords(particle)  
    
        # Add particle to corresponding cell or create cell if it does not exist
        if haskey(threadlocal, cell_coords)
            append!(threadlocal[cell_coords], particle)
        else
            threadlocal[cell_coords] = [particle]
        end
    end
    
    mergewith!(append!, hashtable, hashtables...)
    Here, the loop is more than 10x faster than the serial code, but with the serial mergewith!, I end up being only slightly faster than the serial code.
    I tried to find a library that does a parallel reduce!, but I couldn't find anything useful. I tried to manually write a parallel reduce:
    n = Int(log2(Threads.nthreads()))
    for k in n-1:-1:0
        @threaded for i in 1:2^k
            mergewith!(append!, hashtables[i], hashtables[2^k + i])
        end
    end
    But this is not any faster than the single mergewith! above.
  3. Since the parallel loop above scales well, I tried to follow a suggestion by @ranocha to keep thread-local hash tables throughout the simulation. Unfortunately, I could not find a way to efficiently loop over the neighbors then.
    Right now, the only fast way I found to iterate over only one hash table is with generators and Iterators.flatten to avoid allocations. The code currently looks like this:
    # Generator of all neighboring cells to consider
    neighboring_cells = ((x + i, y + j) for i in -1:1, j in -1:1)
    
    # Merge all lists of particles in the neighboring cells into one iterator
    Iterators.flatten(particles_in_cell(cell, neighborhood_search)
                      for cell in neighboring_cells)
    I couldn't find a way to make that work reasonably fast with multiple hash tables.
  4. This is the algorithm that originally inspired my implementation: https://github.com/InteractiveComputerGraphics/CompactNSearch
    They sort the particles by cell index to avoid storing a list of particles altogether, and just store the starting index for each cell. This seems to work only for large numbers of particles (like O(1M)), since just the sorting process alone takes more time than my serial implementation above for O(10k) particles. The only multithreaded sorting function I found for Julia is from the ThreadsX.jl package, but this only becomes faster for arrays of the size O(100k) and has a huge overhead for O(10k) particles.

For O(1M) particles, it will probably make sense to go with option 5, but we will need MPI or a GPU implementation to be able to afford such simulations, so right now, we won't benefit from option 5 at all.
For our usual simulation sizes, I am completely out of ideas. Suggestions welcome.

@efaulhaber
Copy link
Member Author

efaulhaber commented Feb 22, 2023

Following up on point 4:
I tried to benchmark CompactNSearch, but I have no idea of C++, so I can't properly benchmark it. However, it comes with a demo example, which initializes a spherical point cloud and then iteratively advects the coordinates and prints the time taken to update the NHS.
For 360 particles, it looks like this (hiding 40 of the 50 steps taken):

#Points                                = 360
Search radius                          = 0.0666667
Min x                                  = 0.233333
Max x                                  = 0.466667
Average number of neighbors            = 23.3778
Average index distance prior to z-sort = 118.817
Average index distance after z-sort    = 120.23
Moving points:
Enright step 0. Neighborhood search took 4ms
Enright step 1. Neighborhood search took 3ms
Enright step 2. Neighborhood search took 3ms
Enright step 3. Neighborhood search took 3ms
Enright step 4. Neighborhood search took 3ms
Enright step 5. Neighborhood search took 3ms
Enright step 6. Neighborhood search took 3ms
Enright step 7. Neighborhood search took 4ms
Enright step 8. Neighborhood search took 4ms
Enright step 9. Neighborhood search took 4ms

I implemented the same with Pixie, and I got this (also only showing the first 10 steps):

Step 1. Neighborhood search took 35.281 μs
Step 2. Neighborhood search took 6.930 μs
Step 3. Neighborhood search took 5.671 μs
Step 4. Neighborhood search took 7.500 μs
Step 5. Neighborhood search took 8.520 μs
Step 6. Neighborhood search took 9.480 μs
Step 7. Neighborhood search took 9.251 μs
Step 8. Neighborhood search took 11.981 μs
Step 9. Neighborhood search took 14.940 μs
Step 10. Neighborhood search took 15.150 μs

Here's a quick benchmark for different numbers of particles and threads.

#particles CompactNSearch (single thread) CompactNSearch (16 threads) CompactNSearch (128 threads) Pixie
360 3.5ms 3.5ms 3.5ms 15μs
61432 1.2s 150ms 92ms 6ms
507376 13s 1.6s 1s 130ms
4125288 143s 22s 15.5s 1.36s

It looks like CompactNSearch is scaling pretty nicely up to 16 threads (using OpenMP), but is still a lot slower than my serial implementation, surprisingly even for O(1M) particles.

So I guess our current NHS implementation is actually pretty good, even for larger simulations. It seems like there is absolutely no need to implement anything new until we get to the point where we want to run Pixie on clusters.

@efaulhaber
Copy link
Member Author

I'll close this issue for now, but we may want to reopen it when we start with a distributed-memory implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant