Replies: 3 comments
-
I think we can first focus on one of the storage format (e.g. bitmap-like) and make it extensible. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Some other differents: Redis uses "cached" cardinality during get the counting. Maybe we can leave it as a choice when meet the problem of scaling the getting procedure. Redis sparse representation uses a different way from Presto and Google paper.
|
Beta Was this translation helpful? Give feedback.
0 replies
-
I've update the HLL with only redis-style dense support, cc @git-hulk @PragmaTwice @tutububug |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Basics
Redis HyperLogLog is a probabilistic data structure that estimates the cardinality of a set. The idea comes from original paper [1] and paper from Google. It is particularly useful for applications that require the estimation of unique elements in massive datasets, such as network traffic analysis, data warehousing, and large-scale databases.
The principle behind HyperLogLog is based on the idea that the number of leading zeros in the binary representation of a hash value can be used to infer the size of the set. It includes:
For the remaining bits of the hash value (after the first p bits), the algorithm counts the number of leading zeros. Each entry in the array keeps track of the maximum observed for all elements that hash to the same index. If the newly encountered value is greater than the current value in the register, it updates the register with this new value.
Sparse Representation
The Sparse Representation is an optimization introduced to improve the memory efficiency of the HyperLogLog algorithm, especially when dealing with small cardinalities. Concept of Sparse Representation: Instead of maintaining a dense array of registers for the entire hash space, the Sparse Representation only stores the non-zero counts, significantly reducing memory usage when the actual number of unique elements is small compared to the size of the hash space.
It uses a list to store pairs of
(index, rho(w))
, whereindex
corresponds to the hash value's position in the substream, andrho(w)
is the count of leading zeros plus one for that substream's hash value.Redis Required syntax
The redis supports
pfadd
,pfcount
,pfmerge
here, so generally, we need support:Implement detail
Hash Function
In the HyperLogLog algorithm, the hash function is a critical component that influences the accuracy and efficiency of the cardinality estimation. The function should be Uniform Distribution.
Redis chooses a modified MurmurHash2 function, since MurmurHash2 says [3] "It will not produce the same results on little-endian and big-endian machines."
We can:
Storage format
In Redis, bitmap and HyperLogLog are all regard as "string". The "dense" HyperLogLog in redis
But in kvrocks, we should take them as different thing:
Generally, in HyperLogLog, we need:
For sparse, I think string like is better. For sparse repr, a "size" for number of slots should be used.
Tuning
Redis uses a fixed-size estimation: 6bit per register and 14bits index length(2^14 slots). Presto will tune and allowing self-defined slot number(but only uses 4bit for per-slot). Should we allow user to define more detail about HLL, or just follow the redis way?
[1] http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
[2] http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf
[3] https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp#L13
Beta Was this translation helpful? Give feedback.
All reactions