-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel prime generation #60
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #60 +/- ##
=======================================
Coverage 99.37% 99.37%
=======================================
Files 9 9
Lines 1280 1280
=======================================
Hits 1272 1272
Misses 8 8 ☔ View full report in Codecov by Sentry. |
5x speedup for 8 threads is a pretty good result. I wonder how will safe prime generation behave. Did you try using several |
I did and you are correct that the issue there is having a good way of halting the search on other threads. In my impl I did a |
Finally got around to benchmarking this yesterday:
The number of samples is much lower in this run because of how long it takes to find big safe primes, and the code uses all available cores (as opposed to the results in the PR description that uses half the available cores). The median times are between 6 and 8 times faster using 16 cores and there seems to be an even bigger difference for the slowest/fastest cases (but the sample size is perhaps too small to really tell). I think this is pretty much what I expected. Thoughts? |
I think we can deal with #62 in a subsequent PR |
This adds a new function to the public API, named "par_generate_prime_with_rng", that parallelizes the search for primes.
Outstanding tasks:
Pick the number of threads to useCHANGELOG
entry and better docsResults
During this work I have tried out several different parallel implementations, using
rayon
andchili
. The latter is a new take on low-overhead parallelism and is attractive for its small code size and advertised performance, but in my tests I have not seen good performance from it (see below).When searching for random primes we do not know how many iterations it will take, only that we will eventually find one. This is reflected in the benchmark results as a big variability, as seen in the "fastest" and "slowest" columns below.
Unsurprisingly, the larger the prime we're looking for, the more we benefit from parallelizing the work. For
Uint<64>
s (akaU4096
), using 8 CPUs is 6x faster; forUint<16>
the speedup is more like 3x.The benchmarks use the
divan
benchmarking library and ran on a MacBook Pro M3 Max, using 8 threads. The table compares different implementations and sizes. The implementation in this PR is labeled "rayon_find_any2":There is likely ways to do better than what is presented here, but this is a library and not a prime searching application and there's a value in keeping things simple&nimble. A more efficient parallelizing solution would probably use more code to synchronize threads (channels or shared memory); indeed the reason why
chili
is slower thanrayon
here is because the latter has more facilities in place to halt jobs when a prime is found.