Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel prime generation #60

Merged
merged 15 commits into from
Nov 8, 2024
Merged

Parallel prime generation #60

merged 15 commits into from
Nov 8, 2024

Conversation

dvdplm
Copy link
Contributor

@dvdplm dvdplm commented Oct 21, 2024

This adds a new function to the public API, named "par_generate_prime_with_rng", that parallelizes the search for primes.

Outstanding tasks:

  • Add the corresponding "par_generate_safe_prime_with_rng"
  • Pick the number of threads to use
  • Decide is it is useful for end users to be able to set the number of threads or not (and if yes, what the API for the should be: generics? another function arg? Env variable?) (7491a06 adds another function arg)
  • Consider a name change (deferred to Rename prime finding methods #62)
  • More and better tests
  • CHANGELOG entry and better docs
  • Write criterion benchmarks

Results

During this work I have tried out several different parallel implementations, using rayon and chili. The latter is a new take on low-overhead parallelism and is attractive for its small code size and advertised performance, but in my tests I have not seen good performance from it (see below).

When searching for random primes we do not know how many iterations it will take, only that we will eventually find one. This is reflected in the benchmark results as a big variability, as seen in the "fastest" and "slowest" columns below.

Unsurprisingly, the larger the prime we're looking for, the more we benefit from parallelizing the work. For Uint<64>s (aka U4096), using 8 CPUs is 6x faster; for Uint<16> the speedup is more like 3x.

The benchmarks use the divan benchmarking library and ran on a MacBook Pro M3 Max, using 8 threads. The table compares different implementations and sizes. The implementation in this PR is labeled "rayon_find_any2":

benches                    fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ benches                               │               │               │               │         │
   ├─ pargen                             │               │               │               │         │
   │  ├─ chili_join2                     │               │               │               │         │
   │  │  ├─ Uint<16>       3.248 ms      │ 54.54 ms      │ 11.3 ms       │ 13.97 ms      │ 250     │ 250
   │  │  ├─ Uint<32>       24.23 ms      │ 3.385 s       │ 124.6 ms      │ 189.2 ms      │ 250     │ 250
   │  │  ╰─ Uint<64>       209.8 ms      │ 33.87 s       │ 1.622 s       │ 2.737 s       │ 250     │ 250
   │  ├─ rayon_find_any                  │               │               │               │         │
   │  │  ├─ Uint<16>       2.885 ms      │ 17.25 ms      │ 5.252 ms      │ 5.946 ms      │ 250     │ 250
   │  │  ├─ Uint<32>       20.37 ms      │ 270.7 ms      │ 41.78 ms      │ 56.22 ms      │ 250     │ 250
   │  │  ╰─ Uint<64>       156.7 ms      │ 3.187 s       │ 558.3 ms      │ 683.7 ms      │ 250     │ 250
   │  ├─ rayon_find_any2 <–– THIS PR     │               │               │               │         │ 
   │  │  ├─ Uint<16>       2.825 ms      │ 14.61 ms      │ 5.367 ms      │ 5.742 ms      │ 250     │ 250
   │  │  ├─ Uint<32>       20.23 ms      │ 175.4 ms      │ 46.5 ms       │ 56.24 ms      │ 250     │ 250
   │  │  ╰─ Uint<64>       156.6 ms      │ 2.749 s       │ 523 ms        │ 703.5 ms      │ 250     │ 250
   │  ├─ rayon_join                      │               │               │               │         │
   │  │  ├─ Uint<16>       4.125 ms      │ 16.74 ms      │ 5.569 ms      │ 6.353 ms      │ 250     │ 250
   │  │  ├─ Uint<32>       30.1 ms       │ 256.4 ms      │ 41.14 ms      │ 55.48 ms      │ 250     │ 250
   │  │  ╰─ Uint<64>       233.8 ms      │ 3.044 s       │ 608.6 ms      │ 693.1 ms      │ 250     │ 250
   │  ╰─ single  <–– MASTER              │               │               │               │         │
   │     ├─ Uint<16>       2.276 ms      │ 91.08 ms      │ 14.47 ms      │ 18.83 ms      │ 250     │ 250
   │     ├─ Uint<32>       17.33 ms      │ 1.277 s       │ 180.3 ms      │ 256.4 ms      │ 250     │ 250
   │     ╰─ Uint<64>       156.2 ms      │ 18.13 s       │ 3.005 s       │ 4.071 s       │ 250     │ 250

There is likely ways to do better than what is presented here, but this is a library and not a prime searching application and there's a value in keeping things simple&nimble. A more efficient parallelizing solution would probably use more code to synchronize threads (channels or shared memory); indeed the reason why chili is slower than rayon here is because the latter has more facilities in place to halt jobs when a prime is found.

Copy link

codecov bot commented Oct 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.37%. Comparing base (056781d) to head (b273519).
Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master      #60   +/-   ##
=======================================
  Coverage   99.37%   99.37%           
=======================================
  Files           9        9           
  Lines        1280     1280           
=======================================
  Hits         1272     1272           
  Misses          8        8           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/presets.rs Outdated Show resolved Hide resolved
src/presets.rs Show resolved Hide resolved
@fjarri
Copy link
Member

fjarri commented Oct 22, 2024

5x speedup for 8 threads is a pretty good result. I wonder how will safe prime generation behave.

Did you try using several Sieve objects in any of the other implementations? That would require synchronization between threads though, so that you could cancel them as soon as you find something.

@dvdplm
Copy link
Contributor Author

dvdplm commented Oct 22, 2024

id you try using several Sieve objects in any of the other implementations? That would require synchronization between threads though, so that you could cancel them as soon as you find something.

I did and you are correct that the issue there is having a good way of halting the search on other threads. In my impl I did a take(32) on the Sieve and handed out chunks of candidates to check. That way each task would waste a limited amount of time when a different task found a prime already. The mean/median time is actually not too bad with such a heavy handed approach, but the fastest/slowest times suffer.

@dvdplm
Copy link
Contributor Author

dvdplm commented Nov 5, 2024

I wonder how will safe prime generation behave.

Finally got around to benchmarking this yesterday:

benches                  fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ benches                             │               │               │               │         │
   ╰─ pargen                           │               │               │               │         │
      ╰─ safe_primes                   │               │               │               │         │
         ├─ rayon                      │               │               │               │         │
         │  ├─ Uint<16>  12.21 ms      │ 162 ms        │ 74.58 ms      │ 74.78 ms      │ 13      │ 13
         │  ├─ Uint<32>  319 ms        │ 5.105 s       │ 1.97 s        │ 2.297 s       │ 10      │ 10
         │  ╰─ Uint<64>  11.53 s       │ 3.058 m       │ 1.371 m       │ 1.533 m       │ 10      │ 10
         ╰─ single                     │               │               │               │         │
            ├─ Uint<16>  35.51 ms      │ 2.04 s        │ 530.9 ms      │ 699.2 ms      │ 10      │ 10
            ├─ Uint<32>  762 ms        │ 2.54 m        │ 15.73 s       │ 29.9 s        │ 10      │ 10
            ╰─ Uint<64>  10.26 s       │ 45.91 m       │ 8.388 m       │ 14.11 m       │ 10      │ 10

The number of samples is much lower in this run because of how long it takes to find big safe primes, and the code uses all available cores (as opposed to the results in the PR description that uses half the available cores). The median times are between 6 and 8 times faster using 16 cores and there seems to be an even bigger difference for the slowest/fastest cases (but the sample size is perhaps too small to really tell).

I think this is pretty much what I expected. Thoughts?

@dvdplm dvdplm marked this pull request as ready for review November 5, 2024 14:33
Cargo.toml Outdated Show resolved Hide resolved
Cargo.toml Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
@fjarri
Copy link
Member

fjarri commented Nov 5, 2024

I think we can deal with #62 in a subsequent PR

@fjarri fjarri added the enhancement New feature or request label Nov 5, 2024
@fjarri fjarri merged commit 3918fc4 into master Nov 8, 2024
13 checks passed
@fjarri fjarri deleted the dp-rayon-prime-gen branch November 8, 2024 01:28
@dvdplm dvdplm mentioned this pull request Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants