How to improve neighborhood search #65

efaulhaber · 2023-02-15T12:19:43Z

Currently, the NHS update is a bottleneck of the simulation (see #37).
On one thread, it's reasonable fast and takes less than 10% of the runtime. However, the update process is not parallelized, so on 24 threads, up to 50% of the total simulation time is spent in the NHS update.

I've spent the last few days trying to make either the initialization or the update scale with the number of threads.
Unfortunately, I failed and couldn't even find an implementation that's faster on 24 threads than the current serial implementation.
I ultimately ended up with #64, a 7 line change to make the serial update faster when only a few particles move cells.

The problem is that I can't build the hash table data structure in parallel. This is a Dict{NTuple{NDIMS, Int}, Vector{Int}}, which for each cell (referenced by cell coordinates, e.g. (10, 13)) contains a vector of all particle IDs in this cell.
The initialization process basically looks like this:

empty!(hashtable)

for particle in eachparticle(container)
    cell_coords = get_cell_coords(particle)

    # Add particle to corresponding cell or create cell if it does not exist
    if haskey(hashtable, cell_coords)
        append!(hashtable[cell_coords], particle)
    else
        hashtable[cell_coords] = [particle]
    end
end

We can't make this loop threaded, since append! is not thread-safe.
I tried the following things to parallelize this code:

Loop over the cells (with @threaded), so that each cell is filled by only one thread, and we don't append! to the same vector from multiple threads. I couldn't even find a way to get a unique list of cells to loop over in the time in which the serial code above terminates.

Fill a separate hash table for each thread, and then merge them into one. This basically looks like this:

@threaded for particle in eachparticle(container)
    threadlocal = hashtables[Threads.threadid()]

    # Get cell index of the particle's cell
    cell_coords = get_cell_coords(particle)  

    # Add particle to corresponding cell or create cell if it does not exist
    if haskey(threadlocal, cell_coords)
        append!(threadlocal[cell_coords], particle)
    else
        threadlocal[cell_coords] = [particle]
    end
end

mergewith!(append!, hashtable, hashtables...)

Here, the loop is more than 10x faster than the serial code, but with the serial mergewith!, I end up being only slightly faster than the serial code.
I tried to find a library that does a parallel reduce!, but I couldn't find anything useful. I tried to manually write a parallel reduce:

n = Int(log2(Threads.nthreads()))
for k in n-1:-1:0
    @threaded for i in 1:2^k
        mergewith!(append!, hashtables[i], hashtables[2^k + i])
    end
end

But this is not any faster than the single mergewith! above.

Since the parallel loop above scales well, I tried to follow a suggestion by @ranocha to keep thread-local hash tables throughout the simulation. Unfortunately, I could not find a way to efficiently loop over the neighbors then.
Right now, the only fast way I found to iterate over only one hash table is with generators and Iterators.flatten to avoid allocations. The code currently looks like this:
```
# Generator of all neighboring cells to consider
neighboring_cells = ((x + i, y + j) for i in -1:1, j in -1:1)

# Merge all lists of particles in the neighboring cells into one iterator
Iterators.flatten(particles_in_cell(cell, neighborhood_search)
                  for cell in neighboring_cells)
```
I couldn't find a way to make that work reasonably fast with multiple hash tables.
This is the algorithm that originally inspired my implementation: https://github.com/InteractiveComputerGraphics/CompactNSearch
They sort the particles by cell index to avoid storing a list of particles altogether, and just store the starting index for each cell. This seems to work only for large numbers of particles (like O(1M)), since just the sorting process alone takes more time than my serial implementation above for O(10k) particles. The only multithreaded sorting function I found for Julia is from the ThreadsX.jl package, but this only becomes faster for arrays of the size O(100k) and has a huge overhead for O(10k) particles.

For O(1M) particles, it will probably make sense to go with option 5, but we will need MPI or a GPU implementation to be able to afford such simulations, so right now, we won't benefit from option 5 at all.
For our usual simulation sizes, I am completely out of ideas. Suggestions welcome.

The text was updated successfully, but these errors were encountered:

efaulhaber · 2023-02-22T15:17:24Z

Following up on point 4:
I tried to benchmark CompactNSearch, but I have no idea of C++, so I can't properly benchmark it. However, it comes with a demo example, which initializes a spherical point cloud and then iteratively advects the coordinates and prints the time taken to update the NHS.
For 360 particles, it looks like this (hiding 40 of the 50 steps taken):

#Points                                = 360
Search radius                          = 0.0666667
Min x                                  = 0.233333
Max x                                  = 0.466667
Average number of neighbors            = 23.3778
Average index distance prior to z-sort = 118.817
Average index distance after z-sort    = 120.23
Moving points:
Enright step 0. Neighborhood search took 4ms
Enright step 1. Neighborhood search took 3ms
Enright step 2. Neighborhood search took 3ms
Enright step 3. Neighborhood search took 3ms
Enright step 4. Neighborhood search took 3ms
Enright step 5. Neighborhood search took 3ms
Enright step 6. Neighborhood search took 3ms
Enright step 7. Neighborhood search took 4ms
Enright step 8. Neighborhood search took 4ms
Enright step 9. Neighborhood search took 4ms

I implemented the same with Pixie, and I got this (also only showing the first 10 steps):

Step 1. Neighborhood search took 35.281 μs
Step 2. Neighborhood search took 6.930 μs
Step 3. Neighborhood search took 5.671 μs
Step 4. Neighborhood search took 7.500 μs
Step 5. Neighborhood search took 8.520 μs
Step 6. Neighborhood search took 9.480 μs
Step 7. Neighborhood search took 9.251 μs
Step 8. Neighborhood search took 11.981 μs
Step 9. Neighborhood search took 14.940 μs
Step 10. Neighborhood search took 15.150 μs

Here's a quick benchmark for different numbers of particles and threads.

#particles	CompactNSearch (single thread)	CompactNSearch (16 threads)	CompactNSearch (128 threads)	Pixie
`360`	`3.5ms`	`3.5ms`	`3.5ms`	`15μs`
`61432`	`1.2s`	`150ms`	`92ms`	`6ms`
`507376`	`13s`	`1.6s`	`1s`	`130ms`
`4125288`	`143s`	`22s`	`15.5s`	`1.36s`

It looks like CompactNSearch is scaling pretty nicely up to 16 threads (using OpenMP), but is still a lot slower than my serial implementation, surprisingly even for O(1M) particles.

So I guess our current NHS implementation is actually pretty good, even for larger simulations. It seems like there is absolutely no need to implement anything new until we get to the point where we want to run Pixie on clusters.

efaulhaber · 2023-03-06T14:34:03Z

I'll close this issue for now, but we may want to reopen it when we start with a distributed-memory implementation.

efaulhaber added discussion performance labels Feb 15, 2023

efaulhaber mentioned this issue Feb 15, 2023

Make NHS cheaper when only a few particles moved #64

Merged

efaulhaber mentioned this issue Feb 22, 2023

How to improve performance #37

Open

7 tasks

efaulhaber closed this as completed Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to improve neighborhood search #65

How to improve neighborhood search #65

efaulhaber commented Feb 15, 2023 •

edited

Loading

efaulhaber commented Feb 22, 2023 •

edited

Loading

efaulhaber commented Mar 6, 2023

How to improve neighborhood search #65

How to improve neighborhood search #65

Comments

efaulhaber commented Feb 15, 2023 • edited Loading

efaulhaber commented Feb 22, 2023 • edited Loading

efaulhaber commented Mar 6, 2023

efaulhaber commented Feb 15, 2023 •

edited

Loading

efaulhaber commented Feb 22, 2023 •

edited

Loading