Skip to content

An implementation of Chakraborty, Vinodchandran, and Meel's probabilistic solver for the Distinct Elements problem

Notifications You must be signed in to change notification settings

jparr721/distinction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distinction

Distinction is a simple rust package which exports a single function which solves the Distinct Elements problem in optimal* space and time complexity from the work of Chakraborty, Vinodchandran, and Meel. The F0-Estimator algorithm efficiently estimates the number of distinct elements in a data stream using probabilistic techniques. It initializes a probability p and an empty set 𝒳, then iterates over the stream, probabilistically adding elements to 𝒳 with probability p. If the size of 𝒳 reaches a predefined threshold, elements are discarded with a 50% probability, and p is halved, this ensures that 𝒳 remains small, and we use the calculated value thresh as our upper bound on the size of the set. This process continues until the end of the stream, at which point the estimated number of distinct elements is calculated as the size of 𝒳 divided by p. This approach ensures space efficiency by keeping 𝒳 small, leveraging probabilistic sampling to handle large data streams effectively.

*My implementation is not the most optimal implementation of the algorithm due to performing the operations according to the paper semi-naively, so you could probably make something faster if you wanted to.

Example

use distinction::{Gen, find_n_distinct};
let mut gen = Gen::new(None);
let stream = vec![1, 10, 20, 10, 10, 30, 20, 10, 20, 20, 1, 1, 1];
let eps = 0.1;
let delta = 0.005;
let n_distinct = find_n_distinct(&stream, eps, delta, Some(gen));
assert_eq!(n_distinct, 4);

About

An implementation of Chakraborty, Vinodchandran, and Meel's probabilistic solver for the Distinct Elements problem

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published