Skip to content
This repository has been archived by the owner on Dec 17, 2024. It is now read-only.

Latest commit

 

History

History
170 lines (143 loc) · 8.37 KB

README.md

File metadata and controls

170 lines (143 loc) · 8.37 KB

HyperLogLog for Erlang

This is an implementation of the HyperLogLog algorithm in Erlang. Using HyperLogLog you can estimate the cardinality of very large data sets using constant memory. The relative error is 1.04 * sqrt(2^P). When creating a new HyperLogLog filter, you provide the precision P, allowing you to trade memory for accuracy. The union of two filters is lossless.

In practice this allows you to build efficient analytics systems. For example, you can create a new filter in each mapper and feed it a portion of your dataset while the reducers simply union together all filters they receive. The filter you end up with is exactly the same filter as if you would sequentially insert all data into a single filter.

In addition to the base algorithm, we have implemented the bias correction from HLL++ as the described in the excellent paper by Google. Bias correction greatly improves the estimates for lower cardinalities.

Usage

1> hyper:insert(<<"foobar">>, hyper:insert(<<"quux">>, hyper:new(4))).
{hyper,4,
       {hyper_binary,{dense,<<0,0,0,0,0,0,0,0,64,0,0,0>>,
                            [{8,1}],
                            1,16}}}

2> hyper:card(v(-1)).
2.136502281992361

The errors introduced by estimations can be seen in this example:

3> random:seed(1,2,3).
undefined
4> Run = fun (P, Card) -> hyper:card(lists:foldl(fun (_, H) -> Int = random:uniform(10000000000000), hyper:insert(<<Int:64/integer>>, H) end, hyper:new(P), lists:seq(1, Card))) end.
#Fun<erl_eval.12.80484245>
5> Run(12, 10000).
9992.846462080579
6> Run(14, 10000).
10055.568563614219
7> Run(16, 10000).
10007.654167606248

A filter can be persisted and read later. The serialized struct is formatted for usage with jiffy:

8> Filter = hyper:insert(<<"foo">>, hyper:new(4)).
{hyper,4,
       {hyper_binary,{dense,<<4,0,0,0,0,0,0,0,0,0,0,0>>,[],0,16}}}
9> Filter =:= hyper:from_json(hyper:to_json(Filter)).
true

You can select a different backend. See below for a description of why you might want to do so. They serialize in exactly the same way, but can't be mixed in memory.

1> Gb = hyper:insert(<<"foo">>, hyper:new(4, hyper_gb)).
{hyper,4,{hyper_gb,{{1,{0,1,nil,nil}},16}}}
2> B = hyper:insert(<<"foo">>, hyper:new(4, hyper_binary)).
{hyper,4,
       {hyper_binary,{dense,<<4,0,0,0,0,0,0,0,0,0,0,0>>,[],0,16}}}
3> hyper:to_json(Gb) =:= hyper:to_json(B).
true
4> hyper:union(Gb, B).
** exception error: no case clause matching [{4,hyper_binary},{4,hyper_gb}]
     in function  hyper:union/1 (src/hyper.erl, line 65)

Is it any good?

Yes. At Game Analytics we use it extensively.

Backends

Effort has been spent on implementing different backends in the pursuit of finding the right performance trade-off. The estimate will always be the same, regardless of backend. A simple performance comparison can be seen by running make perf_report, see below for the results from an i7-3770 at 3.4 GHz. Fill rate refers to how many registers has a value other than 0.

  • hyper_binary: Fixed memory usage (6 bits * 2^P), fastest on insert, union, cardinality and serialization. Best default choice.

  • hyper_bisect: Lower memory usage at lower fill rates (3 bytes per used entry), slightly slower than hyper_binary for everything. Switches to a structure similar to hyper_binary when it would save memory. Room for further optimization.

  • hyper_gb: Fast inserts, very fast unions and reasonable memory usage at low fill rates. Unreasonable memory usage at high fill rates.

  • hyper_array: Cardinality estimation is constant, but slower than hyper_gb for low fill rates. Uses much more memory at lower fill rates, but stays constant from 25% and upwards.

  • hyper_binary_rle: Dud

You can also implement your own backend. In hyper_test theres a bunch of tests run for all backends, including some PropEr tests. The test suite will ensure your backend gives correct estimates and correctly encodes/decodes the serialized filters.

$ make perf_report
...

module       P        card   fill      bytes  insert us   union ms    card ms    json ms
hyper_gb     15          1   0.00         64     301.90       0.00       0.10       2.69
hyper_gb     15        100   0.00       3984       1.34       0.05       0.05       6.13
hyper_gb     15        500   0.02      19784       1.53       0.27       0.12       8.67
hyper_gb     15       1000   0.03      39384       1.72       0.53       0.19       8.67
hyper_gb     15       2500   0.07      96384       1.84       1.49       0.40      10.67
hyper_gb     15       5000   0.14     185224       1.99       3.24       0.71      12.22
hyper_gb     15      10000   0.26     344464       2.10       6.80       1.31      15.09
hyper_gb     15      15000   0.37     481664       2.02      10.07       1.91      17.99
hyper_gb     15      25000   0.53     698344       2.08      16.42       2.67      18.71
hyper_gb     15      50000   0.78    1027504       1.96      31.49       4.29      18.87
hyper_gb     15     100000   0.95    1248144       1.78      49.90       5.20      17.92
hyper_gb     15    1000000   1.00    1310744       1.10     108.38       5.72      18.96
hyper_array  15          1   0.00        520       4.10       0.00       4.74       3.83
hyper_array  15        100   0.00      19536       1.56       0.10       4.71       3.99
hyper_array  15        500   0.02      69328       1.44       0.43       4.63       4.46
hyper_array  15       1000   0.03     107760       1.51       0.79       4.68       4.29
hyper_array  15       2500   0.07     188384       1.35       1.79       4.70       5.16
hyper_array  15       5000   0.14     261520       1.31       3.27       4.72       5.33
hyper_array  15      10000   0.26     308072       1.21       5.45       4.99       6.90
hyper_array  15      15000   0.37     320128       1.22       7.34       4.96       7.72
hyper_array  15      25000   0.53     323384       1.21      11.07       5.18       8.42
hyper_array  15      50000   0.78     323560       1.04      17.90       5.51       8.08
hyper_array  15     100000   0.95     323560       1.00      26.93       5.60       7.70
hyper_array  15    1000000   1.00     323560       0.72      51.65       5.77       7.77
hyper_bisect 15          1   0.00          3       6.10       0.00       0.05       7.26
hyper_bisect 15        100   0.00        297       1.62       0.08       0.12      17.42
hyper_bisect 15        500   0.02       1482       1.72       0.42       0.59      21.30
hyper_bisect 15       1000   0.03       2952       1.83       0.95       1.10      23.21
hyper_bisect 15       2500   0.07       7227       1.97       2.63       2.53      25.91
hyper_bisect 15       5000   0.14      13890       2.23       6.50       5.06      28.80
hyper_bisect 15      10000   0.26      25833       2.74      18.21       9.18      31.61
hyper_bisect 15      15000   0.37      32768       3.67      33.29       4.40       4.46
hyper_bisect 15      25000   0.53      32768       3.11      78.77       4.60       5.17
hyper_bisect 15      50000   0.78      32768       2.53     190.66       4.90       4.89
hyper_bisect 15     100000   0.95      32768       2.05     381.11       5.07       4.43
hyper_bisect 15    1000000   1.00      32768       0.85      27.17       5.65       4.59
hyper_binary 15          1   0.00         88       3.40       0.00       6.06       2.23
hyper_binary 15        100   0.00       4048       0.63       0.01       5.91       2.38
hyper_binary 15        500   0.02      20048       0.50       0.01       5.67       2.58
hyper_binary 15       1000   0.02      24576       2.13       0.00       2.72       1.33
hyper_binary 15       2500   0.07      24576       1.54       1.97       2.72       1.95
hyper_binary 15       5000   0.12      24576       1.23       2.40       2.71       2.77
hyper_binary 15      10000   0.26      24576       1.10      11.16       2.95       4.46
hyper_binary 15      15000   0.34      24576       1.11      12.30       2.75       5.48
hyper_binary 15      25000   0.50      24576       0.95      11.97       2.79       5.83
hyper_binary 15      50000   0.76      24576       0.92      13.55       2.81       5.65
hyper_binary 15     100000   0.95      24576       0.79      11.74       2.59       5.16
hyper_binary 15    1000000   1.00      24576       0.55      13.88       2.64       5.11