Parallel prime generation #60

dvdplm · 2024-10-21T13:45:51Z

This adds a new function to the public API, named "par_generate_prime_with_rng", that parallelizes the search for primes.

Outstanding tasks:

Add the corresponding "par_generate_safe_prime_with_rng"
~~Pick the number of threads to use~~
Decide is it is useful for end users to be able to set the number of threads or not (and if yes, what the API for the should be: generics? another function arg? Env variable?) (7491a06 adds another function arg)
Consider a name change (deferred to Rename prime finding methods #62)
More and better tests
CHANGELOG entry and better docs
Write criterion benchmarks

Results

During this work I have tried out several different parallel implementations, using rayon and chili. The latter is a new take on low-overhead parallelism and is attractive for its small code size and advertised performance, but in my tests I have not seen good performance from it (see below).

When searching for random primes we do not know how many iterations it will take, only that we will eventually find one. This is reflected in the benchmark results as a big variability, as seen in the "fastest" and "slowest" columns below.

Unsurprisingly, the larger the prime we're looking for, the more we benefit from parallelizing the work. For Uint<64>s (aka U4096), using 8 CPUs is 6x faster; for Uint<16> the speedup is more like 3x.

The benchmarks use the divan benchmarking library and ran on a MacBook Pro M3 Max, using 8 threads. The table compares different implementations and sizes. The implementation in this PR is labeled "rayon_find_any2":

benches                    fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ benches                               │               │               │               │         │
   ├─ pargen                             │               │               │               │         │
   │  ├─ chili_join2                     │               │               │               │         │
   │  │  ├─ Uint<16>       3.248 ms      │ 54.54 ms      │ 11.3 ms       │ 13.97 ms      │ 250     │ 250
   │  │  ├─ Uint<32>       24.23 ms      │ 3.385 s       │ 124.6 ms      │ 189.2 ms      │ 250     │ 250
   │  │  ╰─ Uint<64>       209.8 ms      │ 33.87 s       │ 1.622 s       │ 2.737 s       │ 250     │ 250
   │  ├─ rayon_find_any                  │               │               │               │         │
   │  │  ├─ Uint<16>       2.885 ms      │ 17.25 ms      │ 5.252 ms      │ 5.946 ms      │ 250     │ 250
   │  │  ├─ Uint<32>       20.37 ms      │ 270.7 ms      │ 41.78 ms      │ 56.22 ms      │ 250     │ 250
   │  │  ╰─ Uint<64>       156.7 ms      │ 3.187 s       │ 558.3 ms      │ 683.7 ms      │ 250     │ 250
   │  ├─ rayon_find_any2 <–– THIS PR     │               │               │               │         │ 
   │  │  ├─ Uint<16>       2.825 ms      │ 14.61 ms      │ 5.367 ms      │ 5.742 ms      │ 250     │ 250
   │  │  ├─ Uint<32>       20.23 ms      │ 175.4 ms      │ 46.5 ms       │ 56.24 ms      │ 250     │ 250
   │  │  ╰─ Uint<64>       156.6 ms      │ 2.749 s       │ 523 ms        │ 703.5 ms      │ 250     │ 250
   │  ├─ rayon_join                      │               │               │               │         │
   │  │  ├─ Uint<16>       4.125 ms      │ 16.74 ms      │ 5.569 ms      │ 6.353 ms      │ 250     │ 250
   │  │  ├─ Uint<32>       30.1 ms       │ 256.4 ms      │ 41.14 ms      │ 55.48 ms      │ 250     │ 250
   │  │  ╰─ Uint<64>       233.8 ms      │ 3.044 s       │ 608.6 ms      │ 693.1 ms      │ 250     │ 250
   │  ╰─ single  <–– MASTER              │               │               │               │         │
   │     ├─ Uint<16>       2.276 ms      │ 91.08 ms      │ 14.47 ms      │ 18.83 ms      │ 250     │ 250
   │     ├─ Uint<32>       17.33 ms      │ 1.277 s       │ 180.3 ms      │ 256.4 ms      │ 250     │ 250
   │     ╰─ Uint<64>       156.2 ms      │ 18.13 s       │ 3.005 s       │ 4.071 s       │ 250     │ 250

There is likely ways to do better than what is presented here, but this is a library and not a prime searching application and there's a value in keeping things simple&nimble. A more efficient parallelizing solution would probably use more code to synchronize threads (channels or shared memory); indeed the reason why chili is slower than rayon here is because the latter has more facilities in place to halt jobs when a prime is found.

…thod

codecov · 2024-10-21T13:47:19Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.37%. Comparing base (056781d) to head (b273519).
Report is 1 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master      #60   +/-   ##
=======================================
  Coverage   99.37%   99.37%           
=======================================
  Files           9        9           
  Lines        1280     1280           
=======================================
  Hits         1272     1272           
  Misses          8        8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/presets.rs

fjarri · 2024-10-22T06:29:50Z

5x speedup for 8 threads is a pretty good result. I wonder how will safe prime generation behave.

Did you try using several Sieve objects in any of the other implementations? That would require synchronization between threads though, so that you could cancel them as soon as you find something.

dvdplm · 2024-10-22T08:34:44Z

id you try using several Sieve objects in any of the other implementations? That would require synchronization between threads though, so that you could cancel them as soon as you find something.

I did and you are correct that the issue there is having a good way of halting the search on other threads. In my impl I did a take(32) on the Sieve and handed out chunks of candidates to check. That way each task would waste a limited amount of time when a different task found a prime already. The mean/median time is actually not too bad with such a heavy handed approach, but the fastest/slowest times suffer.

Avoid a clone

dvdplm · 2024-11-05T12:37:36Z

I wonder how will safe prime generation behave.

Finally got around to benchmarking this yesterday:

benches                  fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ benches                             │               │               │               │         │
   ╰─ pargen                           │               │               │               │         │
      ╰─ safe_primes                   │               │               │               │         │
         ├─ rayon                      │               │               │               │         │
         │  ├─ Uint<16>  12.21 ms      │ 162 ms        │ 74.58 ms      │ 74.78 ms      │ 13      │ 13
         │  ├─ Uint<32>  319 ms        │ 5.105 s       │ 1.97 s        │ 2.297 s       │ 10      │ 10
         │  ╰─ Uint<64>  11.53 s       │ 3.058 m       │ 1.371 m       │ 1.533 m       │ 10      │ 10
         ╰─ single                     │               │               │               │         │
            ├─ Uint<16>  35.51 ms      │ 2.04 s        │ 530.9 ms      │ 699.2 ms      │ 10      │ 10
            ├─ Uint<32>  762 ms        │ 2.54 m        │ 15.73 s       │ 29.9 s        │ 10      │ 10
            ╰─ Uint<64>  10.26 s       │ 45.91 m       │ 8.388 m       │ 14.11 m       │ 10      │ 10

The number of samples is much lower in this run because of how long it takes to find big safe primes, and the code uses all available cores (as opposed to the results in the PR description that uses half the available cores). The median times are between 6 and 8 times faster using 16 cores and there seems to be an even bigger difference for the slowest/fastest cases (but the sample size is perhaps too small to really tell).

I think this is pretty much what I expected. Thoughts?

Cargo.toml

README.md

fjarri · 2024-11-05T23:42:38Z

I think we can deal with #62 in a subsequent PR

Add a "rayon" feature that enables a parallelized prime generation me…

5c7b5f4

…thod

Clippy

52024e3

fjarri reviewed Oct 22, 2024

View reviewed changes

src/presets.rs Outdated Show resolved Hide resolved

fjarri reviewed Oct 22, 2024

View reviewed changes

src/presets.rs Show resolved Hide resolved

dvdplm added 2 commits November 4, 2024 14:27

Drop the threadpool when recursing

f0e5edf

Avoid a clone

Add parallel safe prime generation

e4b1adb

dvdplm mentioned this pull request Nov 5, 2024

Rename prime finding methods #62

Open

dvdplm added 8 commits November 5, 2024 14:12

Let users pick the thread count

7491a06

Fix tests

ca8b74d

Document available features and provide an example

9dac4e9

No need for num_cpus

be85f48

CHANGELOG entry

2be5919

Fix doctest in README

5929045

Criterion benches for the rayon feature

2934030

Add parallel versions of generate_prime and generate_safe_prime

61a860b

dvdplm marked this pull request as ready for review November 5, 2024 14:33

fjarri reviewed Nov 5, 2024

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

fjarri reviewed Nov 5, 2024

View reviewed changes

Cargo.toml Show resolved Hide resolved

fjarri reviewed Nov 5, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

fjarri reviewed Nov 5, 2024

View reviewed changes

README.md Show resolved Hide resolved

fjarri added the enhancement New feature or request label Nov 5, 2024

dvdplm added 3 commits November 6, 2024 10:45

Rename the feature to "multicore" and other review feedback

d994c51

Missed feature gated NonZeroU32s

0e2e7e6

Merge branch 'master' into dp-rayon-prime-gen

b273519

fjarri merged commit 3918fc4 into master Nov 8, 2024
13 checks passed

fjarri mentioned this pull request Nov 8, 2024

Parallel random number generation #32

Closed

fjarri deleted the dp-rayon-prime-gen branch November 8, 2024 01:28

dvdplm mentioned this pull request Dec 5, 2024

Generalize sieving #64

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel prime generation #60

Parallel prime generation #60

dvdplm commented Oct 21, 2024 •

edited

Loading

codecov bot commented Oct 21, 2024 •

edited

Loading

fjarri commented Oct 22, 2024

dvdplm commented Oct 22, 2024

dvdplm commented Nov 5, 2024

fjarri commented Nov 5, 2024

Parallel prime generation #60

Parallel prime generation #60

Conversation

dvdplm commented Oct 21, 2024 • edited Loading

Results

codecov bot commented Oct 21, 2024 • edited Loading

Codecov Report

fjarri commented Oct 22, 2024

dvdplm commented Oct 22, 2024

dvdplm commented Nov 5, 2024

fjarri commented Nov 5, 2024

dvdplm commented Oct 21, 2024 •

edited

Loading

codecov bot commented Oct 21, 2024 •

edited

Loading