Skip to content

Commit

Permalink
Refactored the quantiles documents
Browse files Browse the repository at this point in the history
  • Loading branch information
leerho committed Sep 24, 2024
1 parent 4ce7909 commit e7199f8
Show file tree
Hide file tree
Showing 4 changed files with 126 additions and 65 deletions.
98 changes: 73 additions & 25 deletions docs/KLL/KLLSketch.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,42 +19,75 @@ layout: doc_page
specific language governing permissions and limitations
under the License.
-->
## Contents
<!-- TOC -->
* [Introduction to the Quantile Sketches](https://datasketches.apache.org/docs/QuantilesAll/QuantilesOverview.html)
* [Kll Sketch](#kll-sketch)
* [Comparing the KllSketches with the original classic Quantiles Sketches](#comparing)
* [Plots for KllDoublesSketch vs. classic Quantiles DoublesSketch](#plots)
* [Simple Java KLL Example](#simple-example)
* [KLL Accuracy And Size](https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html)
* [Understanding KLL Bounds](https://datasketches.apache.org/docs/KLL/UnderstandingKLLBounds.html)
* Examples
* [KLL Sketch C++ Example](https://datasketches.apache.org/docs/KLL/KLLCppExample.html)
* Tutorials
* [Sketching Quantiles and Ranks Tutorial](https://datasketches.apache.org/docs/QuantilesAll/SketchingQuantilesAndRanksTutorial.html)
* Theory
* [Optimal Quantile Approximation in Streams](https://github.com/apache/datasketches-website/tree/master/docs/pdf/Quantiles_KLL.pdf)
* [References](https://datasketches.apache.org/docs/QuantilesAll/QuantilesReferences.html)
<!-- TOC -->

<a id="kll-sketch"></a>
## KLL Sketch

Implementation of a very compact quantiles sketch with lazy compaction scheme and nearly optimal accuracy per bit.
See <a href="https://arxiv.org/abs/1603.05346v2">Optimal Quantile Approximation in Streams, by Zohar Karnin, Kevin Lang, Edo Liberty</a>.
The name KLL is composed of the initial letters of the last names of the authors.
This is an implementation of a very compact quantiles sketch with lazy compaction scheme and nearly optimal accuracy per bit of storage. The underlying theoretical work is the paper
<a href="https://arxiv.org/abs/1603.05346v2">Optimal Quantile Approximation in Streams, by Zohar Karnin, Kevin Lang, Edo Liberty</a>. The name KLL is composed of the initial letters of the last names of the authors.

The usage of KllSketch is very similar to the classic quantiles DoublesSketch.
This implementation includes 16 variations of the KLL Sketch, including a base KllSketch for methods common to all sketches. The implementation variations are across 3 different dimensions:

* The key feature of this sketch is its compactness for a given accuracy.
* It is separately implemented for both float and double values and can be configured for use on-heap or off-heap (Direct mode).
* The parameter K that affects the accuracy and the size of the sketch is not restricted to powers of 2.
* The default of 200 was chosen to yield approximately the same normalized rank error (1.65%) as the classic quantiles DoublesSketch (K=128, error 1.73%).
* Input type: double, float, long, item(generic)
* Memory type: heap, direct (off-heap)
* Stored Size: compact (read-only), updatable

### Java example
With the one exception that the KllItemSketch is not available in direct, updatable form.
The classes are organized in an inheritance hierarchy as follows:

```
import org.apache.datasketches.kll.KllFloatsSketch;
* Public KllSketch
* Public KllDoublesSketch
* KllHeapDoublesSketch
* KllDirectDoublesSketch
* KllDirectCompactDoublesSketch

KllFloatsSketch sketch = KllFloatsSketch.newHeapInstance();
int n = 1000000;
for (int i = 0; i < n; i++) {
sketch.update(i);
}
float median = sketch.getQuantile(0.5);
double rankOf1000 = sketch.getRank(1000);
```
* Public KllFloatsSketch
* KllHeapFloatsSketch
* KllDirectFloatsSketch
* KllDirectCompactFloatsSketch

* Public KllItemsSketch<T>
* KllHeapItemsSketch<T>
* KllDirectCompactItemsSketch<T>

* Public KllLongsSketch
* KllHeapLongsSketch
* KllDirectLongsSketch
* KllDirectCompactLongsSketch

The internal package-private variations are constructed using static factory methods from the 4 outer public classes for doubles, floats, items, and longs, respectively

<a id="comparing"></a>
### Comparing the KLL Sketches with the original classic Quantiles Sketches

### Differences of KllSketch from the original quantiles DoublesSketch
The usage of KllDoublesSketch is very similar to the classic quantiles DoublesSketch.

* KLL has a smaller size for the same accuracy.
* KLL is slightly faster to update.
* The KLL parameter K doesn't have to be power of 2.
* KLL operates with either float values or double values.
* The key feature of KLL sketch is its compactness for a given accuracy. KLL has a much smaller size for the same accuracy (see the plots below).
* The KLL parameter K, which affects accuracy and size, doesn't have to be power of 2. The default K of 200 was chosen to yield approximately the same normalized rank error (1.65%) as the classic quantiles DoublesSketch (K=128, error 1.73%).
* The classic quantiles sketch only has double and item(generic) input types, while KLL (as mentioned above) is implemented with double, float, long, and item(generic) types.
* KLL uses a merge method rather than a union object.

The starting point for the comparison is setting K in such a way that rank error would be approximately the same. As pointed out above, the default K for both sketches should achieve this. Here is the comparison of the single-sided normalized rank error (getRank() method) for the default K:
<a id="plots"></a>
### Plot Comparisons of KllDoublesSketch vs. classic Quantiles DoublesSketch

The starting point for the plot comparisons is setting K in such a way that rank error would be approximately the same. As pointed out above, the default K for both sketches should achieve this. Here is the comparison of the single-sided normalized rank error (getRank() method) for the default K:

<img class="doc-img-full" src="{{site.docs_img_dir}}/kll/kll200-vs-ds128-rank-error.png" alt="RankError" />

Expand All @@ -75,3 +108,18 @@ Below is the accuracy per byte measure (the higher the better). Suppose rank err
Below is the update() method speed:

<img class="doc-img-full" src="{{site.docs_img_dir}}/kll/kll200-vs-ds128-update.png" alt="UpdateTime" />

<a id="simple-example"></a>
### Simple Java KLL Floats example

```
import org.apache.datasketches.kll.KllFloatsSketch;
KllFloatsSketch sketch = KllFloatsSketch.newHeapInstance();
int n = 1000000;
for (int i = 0; i < n; i++) {
sketch.update(i);
}
float median = sketch.getQuantile(0.5);
double rankOf1000 = sketch.getRank(1000);
```
37 changes: 35 additions & 2 deletions docs/Quantiles/ClassicQuantilesSketch.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,36 @@ layout: doc_page
specific language governing permissions and limitations
under the License.
-->
## Contents
<!-- TOC -->
* [Introduction to the Quantile Sketches](https://datasketches.apache.org/docs/QuantilesAll/QuantilesOverview.html)
* [Classic Quantiles Sketch](#classic-quantiles-sketch)
* [Classsic Quantiles Sketch Overview](https://datasketches.apache.org/docs/Quantiles/QuantilesSketchOverview.html)
* [Accuracy and Size](#accuracy-and-size)
* [Absolute vs Relative Error](#absolute-vs-relative-error)
* [Quantiles Sketches and Data Independence](#data-independence)
* [Accuracy Table](#accuracy-table)
* [Accuracy Plots](#accuracy-plots)
* Examples
* [Classic Quantiles Java Example](https://datasketches.apache.org/docs/Quantiles/QuantilesJavaExample.html)
* [Classic Quantiles Pig UDFs](https://datasketches.apache.org/docs/Quantiles/QuantilesPigUDFs.html)
* [Classic Quantiles Hive UDFs](https://datasketches.apache.org/docs/Quantiles/QuantilesHiveUDFs.html)
* Classic Quantiles Studies
* [Druid Approximate Histogram](https://datasketches.apache.org/docs/QuantilesStudies/DruidApproxHistogramStudy.html)
* [Moments Sketch Study](https://datasketches.apache.org/docs/QuantilesStudies/MomentsSketchStudy.html)
* [Quantiles StreamA Study](https://datasketches.apache.org/docs/QuantilesStudies/QuantilesStreamAStudy.html)
* [Exact Quantiles for Studies](https://datasketches.apache.org/docs/QuantilesStudies/ExactQuantiles.html)
* Tutorials
* [Sketching Quantiles and Ranks Tutorial](https://datasketches.apache.org/docs/QuantilesAll/SketchingQuantilesAndRanksTutorial.html)
* Theory
* [Relative Error Streaming Quantiles](https://arxiv.org/abs/2004.01668)
* [More References](https://datasketches.apache.org/docs/QuantilesAll/QuantilesReferences.html)
<!-- TOC -->

<a id="classic-quantiles-sketch"></a>
# Classic Quantiles Sketch

<a id="accuracy-and-size"></a>
## Quantiles Sketches Accuracy and Size
Please review the Quantiles [Tutorial]({{site.docs_dir}}/QuantilesAll/SketchingQuantilesAndRanksTutorial.html).

Expand All @@ -29,21 +57,23 @@ the overall size of the sketch.

Accuracy for quantiles sketches is specified and measured with respect to the *rank* only, not the values.

<a id="absolute-vs-relative-error"></a>
### Absolute vs Relative Error
The Quantiles/DoublesSketch and the KLL Sketch have *absolute error*. For example, a specified accuracy of 1% at the median (rank = 0.50) means that the true value (if you could extract it from the set) should be
between *getQuantile(0.49)* and *getQuantile(0.51)*. This same 1% error applied at a rank of 0.95 means that the true value should be between *getQuantile(0.94)* and *getQuantile(0.96)*. In other words, the error is a fixed +/- epsilon for the entire range of rank values.

The ReqSketch, however, has relative rank error and the user can choose which end of the rank domain should have high accuracy. Refer to the sketch documentation for more information.


<a id="data-independence"></a>
## Quantiles Sketches and Data Independence
A *sketch* is an implementation of a *streaming algorithm*. By definition, a sketch has only one chance to examine each item of the stream. It is this property that makes a sketch a *streaming* algorithm and useful for real-time analysis of very large streams that may be impractical to actually store.

We also assume that the sketch knows nothing about the input data stream: its length, the range of the values or how the values are distributed. If the authors of a particular algorithm require the user to know any of the above attributes of the input data stream in order to "tune" the algorithm, the algorithm is not data independent.

The only thing the user needs to know is how to extract the values from the stream so that they can be fed into the sketch. It is reasonable that the user knows the *type* of values in the stream: e.g., are they alphanumeric strings, numeric strings, or numeric primitives. These properties may determine the type of sketch to use as well as how to extract the appropriate quantities to feed into the sketch.

## Accuracy Information for the org.apache.datasketches.quantiles Sketch Package
<a id="accuracy-table"></a>
## Accuracy Table for the Classic Quantiles Sketch
A <i>k</i> of 256 produces a normalized rank error of less than 1%.
For example, the median value returned from getQuantile(0.5) will be between the actual values
from the hypothetically sorted array of input values at normalized ranks of 0.49 and 0.51, with
Expand Down Expand Up @@ -95,6 +125,9 @@ Table Guide for Quantiles DoublesSketch Size in Bytes and Approximate Error:
4,294,967,295 | 3,744 7,200 13,856 26,656 51,232 98,336 188,448 360,480 688,160 1,310,752 2,490,400 4,718,624
</pre>

<a id="accuracy-plots"></a>
## Accuracy Plots

The following graphs illustrate the ability of the Quantiles DoublesSketch to characterize value distributions.

* 1024 (n) values (trueUnsortedValues) were generated from Random's nextGaussian(). These values were then sorted (trueSortedValues) and assigned
Expand Down
26 changes: 16 additions & 10 deletions docs/Quantiles/QuantilesSketchOverview.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,15 @@ layout: doc_page
specific language governing permissions and limitations
under the License.
-->
# Quantiles Sketch Overview
## Contents

* [Classic Quantiles Sketch Overview](#overview)
* [Numeric Quantiles](#numeric-quantiles)
* [Extending Generic Quantiles Classes](#extending)
* [Implementation Notes](#notes)

<a id="overview"></a>
# Classic Quantiles Sketch Overview

Package: org.apache.datasketches.quantiles

Expand All @@ -30,12 +38,10 @@ Probability Mass Function, getPMF(), and the Cumulative Distribution Function, g

* **NOTE:** See also the <a href="{{site.docs_dir}}/KLL/KLLSketch.html">KLL Sketch</a>.

### Section links:
* [Section 1](#Section 1) Numeric Quantiles
* [Section 2](#Section 2) Extending Generic Quantiles Classes
* [Section 3](#Section 3) Implementation Notes

## <a name="Section 1"></a>Numeric Quantiles

<a id="numeric-quantiles"></a>
## Numeric Quantiles

Consider this real data example of a stream of 230 million time-spent events collected from one our systems for a period of just 30 minutes. Each event records the amount of time in milliseconds that a user spends on a web page before moving to a different web page by taking some action, such as a click.

Expand Down Expand Up @@ -175,8 +181,8 @@ Code examples are best gleaned from the test code that exercises all the various
### END SKETCH SUMMARY
*/


## <a name="Section 2"></a>Extending Generic Quantiles Classes
<a id="extending"></a>
## Extending Generic Quantiles Classes

Any item type that is comparable, or for which you can create a Comparator, can also be analyzed by extending the abstract generic classes for that particular item.

Expand Down Expand Up @@ -240,8 +246,8 @@ Then obtain the split point values that equally partition the data into 10 parti
Using a simple binary search you can now split your data into the 10 partitions.



## <a name="Section 3"></a>Implementation Notes
<a id="notes"></a>
## Implementation Notes

The quantiles algorithm is an implementation of the Low Discrepancy Mergeable Quantiles Sketch, using double values, described in section 3.2 of the journal version of the paper "Mergeable Summaries" by Agarwal, Cormode, Huang, Phillips, Wei, and Yi.
<a href="http://dblp.org/rec/html/journals/tods/AgarwalCHPWY13"></a> <!-- does not work with https -->
Expand Down
30 changes: 2 additions & 28 deletions src/main/resources/docgen/toc.json
Original file line number Diff line number Diff line change
Expand Up @@ -53,36 +53,10 @@

{ "class":"Dropdown", "desc" : "Quantiles And Histograms", "array":
[
{"class":"Doc", "desc" : "Quantiles and Ranks Tutorial", "dir" : "QuantilesAll", "file": "SketchingQuantilesAndRanksTutorial"},
{"class":"Doc", "desc" : "Quantiles Overview", "dir" : "QuantilesAll", "file": "QuantilesOverview" },
{"class":"Doc", "desc" : "KLL Sketch Family", "dir" : "KLL", "file": "KLLSketch" },
{"class":"Doc", "desc" : "KLL Sketch Accuracy and Size", "dir" : "KLL", "file": "KLLAccuracyAndSize" },
{"class":"Doc", "desc" : "Understanding KLL Bounds", "dir" : "KLL", "file": "UnderstandingKLLBounds" },
{"class":"Doc", "desc" : "REQ Floats sketch", "dir" : "REQ", "file": "ReqSketch" },
{"class":"Doc", "desc" : "KLL Sketches", "dir" : "KLL", "file": "KLLSketch" },
{"class":"Doc", "desc" : "Classic Quantiles Sketches", "dir" : "Quantiles", "file": "ClassicQuantilesSketch" },

{ "class":"Dropdown", "desc" : "Quantiles Examples", "array":
[
{"class":"Doc", "desc" : "Quantiles Sketch Java Example", "dir" : "Quantiles", "file": "QuantilesJavaExample" },
{"class":"Doc", "desc" : "KLL Quantiles Sketch C++ Example", "dir" : "KLL", "file": "KLLCppExample" },
{"class":"Doc", "desc" : "Quantiles Sketch Pig UDFs", "dir" : "Quantiles", "file": "QuantilesPigUDFs" },
{"class":"Doc", "desc" : "Quantiles Sketch Hive UDFs", "dir" : "Quantiles", "file": "QuantilesHiveUDFs" },
]
},
{ "class":"Dropdown", "desc" : "Quantiles Studies", "array":
[
{"class":"Doc", "desc" : "Druid Approximate Histogram", "dir" : "QuantilesStudies", "file": "DruidApproxHistogramStudy" },
{"class":"Doc", "desc" : "Moments Sketch Study", "dir" : "QuantilesStudies", "file": "MomentsSketchStudy" },
{"class":"Doc", "desc" : "Quantiles StreamA Study", "dir" : "QuantilesStudies", "file": "QuantilesStreamAStudy" },
{"class":"Doc", "desc" : "Exact Quantiles for Studies", "dir" : "QuantilesStudies", "file": "ExactQuantiles" },
]
},
{ "class":"Dropdown", "desc" : "Quantiles Sketch Theory", "array":
[
{"class":"Doc", "desc" : "Optimal Quantile Approximation in Streams", "dir" : "", "file": "Quantiles_KLL", "pdf":"true" },
{"class":"Doc", "desc" : "Quantiles References", "dir" : "QuantilesAll", "file": "QuantilesReferences" },
]
},
{"class":"Doc", "desc" : "REQ Floats Sketch", "dir" : "REQ", "file": "ReqSketch" },
]
},

Expand Down

0 comments on commit e7199f8

Please sign in to comment.