A modern C++ implementation of a merging t-digest
data structure.
t-digests
are data structures that provide highly accurate statistics for a large number of inputs in a small memory footprint. Things like minimum, maximum, average, quantile, and continuous distribution are readily calculated. Think improved histogram: compact way of accurately summarizing data where the bucket sizes need not be set in advance.
- No runtime memory allocation.
- Strict, user-defined upper bound on structure size.
- Written in modern C++ (c++17).
- Templated
mean
andweight
data types - save memory for smaller input ranges with a smaller data type. - Speed (average 61ns insertion time per element with a c. 2019 Intel i7)
- A C++17 compatible compiler (developed with
clang 7
using--std=c++17
). CMake
(for building utility and running tests).
#include "digestible/digestible.h"
using namespace std;
using namespace digestible;
// Structure will have at most 20 elements.
// Also default to values of type float and weights of type unsigned.
tdigest digest(20);
// Add some data into the structure.
digest.insert(5);
digest.insert(10);
digest.insert(4);
// digestible buffers input data; explicitly merge it before querying statistics.
digest.merge();
printf("50th quantile is: %f\n", digest.quantile(50));
Additional examples can be found in the tests
directory.
To assist in integrating digestible
in your implementation, please see utility
directory.
As the lone parameter to the digestible
constructor, compression factor requires some discussion. Internally t-digests
use the compression factor to distribute values throughout the input range. When combined with an internal distribution function a t-digest
will distribute data such that the head and tail have fewer points and the middle has significantly more. Higher compression factors will yield better accuracy at the expense of increased memory usage. The digestible utility
can be used to help strike this balance.
For a given compression factor f
, value type size v
, and weight type size w
, the approximate t-digest
size in bytes is:
f * max(v, w) * 8 + 64
This work would not be possible without the original algorithm and reference implementation by tdunning. A local copy of the original t-digest
paper can be found here.