This is an implementation of Bloom Filters for Elixir. It uses NIFs written in Rust for better performance, since Bloom Filters rely highly on mutability.
TL;DR: Huge amount of data ➜ small Bloom Filter : Was X not in Huge amount of data?
For more about this topic consider, checking out:
- Bloom Filters by Jason Davis
- Bloom Filters, Mining of Massive Datasets, Stanford University on Youtube
The package can be installed by adding flower
to your list of dependencies in mix.exs
:
def deps do
[
{:flower, "~> 0.1.4"},
]
end
Also, you need to have Rust installed for development, since this uses NIFs.
Docs can be found at https://hexdocs.pm/flower.
Documentation can be generated with ExDoc.
There is also a small walk through in the Example Section
alias Flower.Bloom, as: Bloom
expected_number_of_unique_elements = 1000
# With a size atom
iex> filter = Bloom.new(:"8 KB", expected_number_of_unique_elements)
...
# Or by a maximum byte size:
iex> filter = Bloom.new_by_byte_size(8 * 1024, expected_number_of_unique_elements)
...
# Or by a bit address length:
iex> filter = Bloom.new(16, expected_number_of_unique_elements)
...
Bit Address Length | Size Atom | Bit Address Length | Size Atom | Bit Address Length | Size Atom |
---|---|---|---|---|---|
13 | 1 KB | 23 | 1 MB | ||
14 | 2 KB | 24 | 2 MB | ||
15 | 4 KB | 25 | 4 MB | ||
6 | 8 Byte | 16 | 8 KB | 26 | 8 MB |
7 | 16 Byte | 17 | 16 KB | 27 | 16 MB |
8 | 32 Byte | 18 | 32 KB | 28 | 32 MB |
9 | 64 Byte | 19 | 64 KB | 29 | 64 MB |
10 | 128 Byte | 20 | 128 KB | 30 | 128 MB |
11 | 256 Byte | 21 | 256 KB | 31 | 256 MB |
12 | 512 Byte | 22 | 512 KB | 32 | 512 MB |
iex> Bloom.insert(filter, 42)
:ok
iex> Bloom.insert(filter, [1,2,{:atom, <<1,2,3,4>>}])
:ok
iex> Bloom.insert(filter, "Hello!")
:ok
iex> Bloom.has_not?(filter, "Hello!")
false
iex> Bloom.has?(filter, [1,2,{:atom, <<1,2,3,4>>}])
true
iex> Bloom.has?(filter, :atom)
false
# `has?` is always the opposite of `has_not?`.
iex> Bloom.has?(filter, 42) != Bloom.has_not?(filter, 42)
true
Was actually inserted? | has? | has_not? |
---|---|---|
yes | yes | no |
no | most of the times: no | most of the times: yes |
filename = "prime_filter.bloom"
file = File.stream!(filename, [:delayed_write, :binary], 8096)
(Bloom.stream(filter) |> Stream.into(file) |> Stream.run)
filename = "prime_filter.bloom"
file = File.stream!(filename, [:read_ahead, :binary], 8096)
{:ok, new_filter} = Bloom.from_stream(file)
iex> Bloom.estimate_count(filter)
3
iex> Bloom.false_positive_probability(filter)
3.2348227494719115e-28
# This number is only that small, because we have just 3 elements in 8 KBs.
Checking if a number is not a prime and below 100_000
.
# helper module
defmodule Step do
def throught(from, to, step, func) when from < to do
func.(from)
throught(from+step, to, step, func)
end
def throught(_,_,_,_), do: :ok
end && :ok
# alias so we need to write less
alias Flower.Bloom, as: Bloom
# Select appropriate size.
# We will put about 100_000 elements in.
# 64 KB = 512 KBit
# 512 KBit / 100_000 ~= 5.25 Bits per element.
# Meeh. Let's see...
non_primes = Bloom.new(:"64 KB", 100_000)
# We can also write
non_primes = Bloom.new(19, 100_000)
# or (it will choose the next smaller size)
non_primes = Bloom.new_by_byte_size(64 * 1024, 100_000)
# Let's put some non primes in:
Bloom.insert(non_primes, 42)
Bloom.insert(non_primes, "one hundred")
Bloom.insert(non_primes, [:ok, {Anything, 1.0e9}])
# Let's double check:
Bloom.has?(non_primes, 42)
Bloom.has?(non_primes, "one hundred")
Bloom.has?(non_primes, 7)
Bloom.has?(non_primes, [:ok, {Anything, 1.0e9}])
Bloom.has?(non_primes, 52)
# Works!
# Now we can get a estimate of
# the number of items we
# put in.
Bloom.estimate_count(non_primes)
# Apply Sieve of Eratosthenes.
# This may take a few seconds.
# At least we can do it with
# constant memory.
for x <- 2..50_000 do
# Skip multiples of previous numbers.
# This is actually not safe to do
# with a bloom filter since they are
# approximations. But let's don't care for now.
if(Bloom.has_not?(non_primes, x)) do
Step.throught(x*2, 100_000, x, fn non_prime ->
# Much of the time is used to calculate
# hashes.
Bloom.insert(non_primes, non_prime)
end)
end
end && :ok
non_primes |> Bloom.has_not?(12) # not a prime
non_primes |> Bloom.has_not?(11) # prime
non_primes |> Bloom.has_not?(6719) # no prime
non_primes |> Bloom.has_not?(4245) # prime
non_primes |> Bloom.has_not?(9973) # no prime
non_primes |> Bloom.has_not?(3549) # prime
non_primes |> Bloom.has_not?(89591) # prime
non_primes |> Bloom.has_not?(84949) # no prime
# Let's write it to disk:
filename = "prime_filter.bloom"
file = File.stream!(filename, [:delayed_write, :read_ahead, :binary], 8096)
(Bloom.stream(non_primes)
|> Stream.into(file)
|> Stream.run)
# Let's inspect the size of the filter:
File.lstat!(filename).size
# We can also read from disk:
{:ok, new_non_primes} = Bloom.from_stream(file)
new_non_primes |> Bloom.has_not?(12) # not a prime
new_non_primes |> Bloom.has_not?(11) # prime
# Works!
# Maybe a bit high, 64 KB for 100_000 Numbers
# is not that much.
Bloom.false_positive_probability(non_primes)
# Let's reestimate .
# There are 9_592 primes below 100_000, so this
# should yield about 100_000 - 9_592 = 90_408.
Bloom.estimate_count(non_primes)
Side note: If you want to check primes, google
search for Miller–Rabin primality test.