-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very slow and memory inefficient algorithm used to sieve primes... #1
Comments
Hi @GordonBGood I agree that the implementation is not really good (this is why the test for primality for bigger ints uses deterministic Miller-Rabin, and probabilistic Miller-Rabin for even bigger numbers). But I am confused about your point 1 (and consequently 3 and 4 as well). The documentation for intsets says literally
Now, I admit I never looked at the implementation, but if I understand correctly this means that a set of integers is represented by some pattern of bits like 00010110101001...001001 where a bit in position k means that k is in the set. For a billion that would be 10^9 bits =~ 125MB, not multiple Gigabytes. It certainly doesn't use multiple bytes per entry. I agree this is quite a lot - primes are sparse so most of that will be 0. Since the ratio of populated entries is about log n / n, for bigger n there are certainly more efficient daa structures, but for smaller n 1 bit per entry is reasonable. Also, I don't understand why in 3 you speak about hashes: if I understand correctly there should be no hashing going on in this implementation. Unless, of course, I have totally mistunderstood what intsets are. Can you explain a little better why you speak of multiple bytes per entry? BTW, I completely agree that one can skip even numbers, or for what is worth various classes modulo k for as big k as you care to implement. |
Yes, I appreciate that, but being so slow, you have to switch to the more complex math sooner than you should have to.
I did have a quick look at the implementation and while it's true that it uses a one bit representation per number in a packed (32-bit) representation, it is still within a nested (so that it can be sparse) hashed tree structure to quickly find the actual populated bits, and that is as far as I looked as I could see why it was taking an average of much more than one bit per represented composite number, and also why it is slow in having to do hashing. As promoted, it is faster per operation than using a full hash table (which is why my So no, it isn't a simple bit-packed array using the amount of memory as you calculated, although it does use some bit-packed words within the hash tables to save some space. To save much more space and time, it is quite trivial to implement a "monolithic" bit-packed array version which would then only use 62.5 Megabytes for a range of a billion for "odds-only" and would reduce the time to cull by about a third, not more because of the remaining cache associativity problem. This type of "naive" SoE (other than moving up one step to use bit-packing and odds-only) can be found anywhere, including RosettaCode. I highly recommend the page segmented version as it avoids the cache associativity problem at the cost of 150 LOC or so.
I've done quite a lot of work with maximal wheel factorization as you describe, and while one can implement it efficiently up to wheels eliminating all the primes up to 7 or even 11, anything above that doesn't offer anything much in gains for the cost. The efficient ways of doing this cost another few hundred LOC. One of these days, I'll be posting a way of doing this on the forum that uses Nim to be faster than Kim Walisch's "primesieve", with the main hold-up ATM being the instability of how Nim does threading. |
Thank you, I did not know the internals of intsets. I can easily replace that with custom intsets, and make it odd only. Or switch to your version, I am still not sure. But in any case, I expect that testing a single prime will be much faster with deterministic Miller Rabin |
@ andreaferretti:
I think you'll find that (my RosettaCode) page segmented version is immensely faster than anything you can do with intsets, and even that is limited by the time it takes to iterate over the found primes, if you don't mind the number of LOC. Further refinements can likely reduce the time by another factor of three or so using multi-threading in spite of the time it takes to iterate to down to a fraction of a second to iterate over the primes to a billion using a reasonable CPU. If you don't need to iterate over the primes, then it can be made even faster down to "primesieve" type speeds, but I assume that you do. If you want to stay simple, an odds-only bit-packed "monolithic" array will do it, take only one bit per odd number to the sieving range, and be about six times faster than your intset version in spite of the huge array to a billion: iterator primesUpTo(top: uint): uint =
let topndx = int((top - 3) div 2)
let sqrtndx = (int(sqrt float64(top)) - 3) div 2
var cmpsts = newSeq[uint32](topndx div 32 + 1)
for i in 0 .. sqrtndx: # cull composites for primes
if (cmpsts[i shr 5] and (1u32 shl (i and 31))) == 0:
let p = i + i + 3
for j in countup((p * p - 3) div 2, topndx, p):
cmpsts[j shr 5] = cmpsts[j shr 5] or (1u32 shl (j and 31))
yield 2 # separate culling above and iteration here
for i in 0 .. topndx:
if (cmpsts[i shr 5] and (1u32 shl (i and 31))) == 0:
yield uint(i + i + 3) |
This repo's implementation of
sieveUpTo
called byprimesUpTo*
has got to be one of the slowest and most wasteful of memory implementation of the Sieve of Eratosthenes of any I have seen...The problems are as follows:
IntSet
as it's base storage, it uses multiple bytes per stored composite entry, which memory use is so high that it can't sieve even up to a billion as it consumes Gigabytes of memory over that range as it uses many bytes per entry for every composite number over the range.It has the single advantage of being short. However, it can't sieve to a billion using a reasonably amount of RAM memory and even if it could it would take about a minute or so.
A much better short implementation which greatly reduces the memory consumption such that very large ranges can be sieved (although still somewhat slowly due to the hashing) is as follows:
This is an "incremental" Odds-Only Sieve of Eratosthenes and only stores a hash entry for each of the base primes up to the square root of the current iteration value (determined automatically by the recursive secondary base primes feed). It still takes about a minute to count the number of primes to a billion, but at least it doesn't take an excessive amount of memory to do so. Its output also doesn't consume memory as it is just an iterator over the "incrementally" found primes.
If being so short isn't a consideration, the Page Segmented Bit-Packed Odds-Only Unbounded Sieve of Eratosthenes from RosettaCode is fast as in only taking about a second to sieve to a billion, consumes even less memory, and can even relatively easily have multi-threading added, but is over a hundred lines of code.
The text was updated successfully, but these errors were encountered: