From e7199f886c0800da1399203a82f6f280784fc16a Mon Sep 17 00:00:00 2001 From: Lee Rhodes Date: Mon, 23 Sep 2024 18:45:41 -0700 Subject: [PATCH] Refactored the quantiles documents --- docs/KLL/KLLSketch.md | 98 +++++++++++++++++------ docs/Quantiles/ClassicQuantilesSketch.md | 37 ++++++++- docs/Quantiles/QuantilesSketchOverview.md | 26 +++--- src/main/resources/docgen/toc.json | 30 +------ 4 files changed, 126 insertions(+), 65 deletions(-) diff --git a/docs/KLL/KLLSketch.md b/docs/KLL/KLLSketch.md index cded5c545..900c8f6e8 100644 --- a/docs/KLL/KLLSketch.md +++ b/docs/KLL/KLLSketch.md @@ -19,42 +19,75 @@ layout: doc_page specific language governing permissions and limitations under the License. --> +## Contents + +* [Introduction to the Quantile Sketches](https://datasketches.apache.org/docs/QuantilesAll/QuantilesOverview.html) +* [Kll Sketch](#kll-sketch) + * [Comparing the KllSketches with the original classic Quantiles Sketches](#comparing) + * [Plots for KllDoublesSketch vs. classic Quantiles DoublesSketch](#plots) + * [Simple Java KLL Example](#simple-example) +* [KLL Accuracy And Size](https://datasketches.apache.org/docs/KLL/KLLAccuracyAndSize.html) +* [Understanding KLL Bounds](https://datasketches.apache.org/docs/KLL/UnderstandingKLLBounds.html) +* Examples + * [KLL Sketch C++ Example](https://datasketches.apache.org/docs/KLL/KLLCppExample.html) +* Tutorials + * [Sketching Quantiles and Ranks Tutorial](https://datasketches.apache.org/docs/QuantilesAll/SketchingQuantilesAndRanksTutorial.html) +* Theory + * [Optimal Quantile Approximation in Streams](https://github.com/apache/datasketches-website/tree/master/docs/pdf/Quantiles_KLL.pdf) + * [References](https://datasketches.apache.org/docs/QuantilesAll/QuantilesReferences.html) + + + ## KLL Sketch -Implementation of a very compact quantiles sketch with lazy compaction scheme and nearly optimal accuracy per bit. -See Optimal Quantile Approximation in Streams, by Zohar Karnin, Kevin Lang, Edo Liberty. -The name KLL is composed of the initial letters of the last names of the authors. +This is an implementation of a very compact quantiles sketch with lazy compaction scheme and nearly optimal accuracy per bit of storage. The underlying theoretical work is the paper +Optimal Quantile Approximation in Streams, by Zohar Karnin, Kevin Lang, Edo Liberty. The name KLL is composed of the initial letters of the last names of the authors. -The usage of KllSketch is very similar to the classic quantiles DoublesSketch. +This implementation includes 16 variations of the KLL Sketch, including a base KllSketch for methods common to all sketches. The implementation variations are across 3 different dimensions: -* The key feature of this sketch is its compactness for a given accuracy. -* It is separately implemented for both float and double values and can be configured for use on-heap or off-heap (Direct mode). -* The parameter K that affects the accuracy and the size of the sketch is not restricted to powers of 2. -* The default of 200 was chosen to yield approximately the same normalized rank error (1.65%) as the classic quantiles DoublesSketch (K=128, error 1.73%). +* Input type: double, float, long, item(generic) +* Memory type: heap, direct (off-heap) +* Stored Size: compact (read-only), updatable -### Java example +With the one exception that the KllItemSketch is not available in direct, updatable form. +The classes are organized in an inheritance hierarchy as follows: -``` -import org.apache.datasketches.kll.KllFloatsSketch; +* Public KllSketch + * Public KllDoublesSketch + * KllHeapDoublesSketch + * KllDirectDoublesSketch + * KllDirectCompactDoublesSketch -KllFloatsSketch sketch = KllFloatsSketch.newHeapInstance(); -int n = 1000000; -for (int i = 0; i < n; i++) { - sketch.update(i); -} -float median = sketch.getQuantile(0.5); -double rankOf1000 = sketch.getRank(1000); -``` + * Public KllFloatsSketch + * KllHeapFloatsSketch + * KllDirectFloatsSketch + * KllDirectCompactFloatsSketch + + * Public KllItemsSketch + * KllHeapItemsSketch + * KllDirectCompactItemsSketch + + * Public KllLongsSketch + * KllHeapLongsSketch + * KllDirectLongsSketch + * KllDirectCompactLongsSketch + +The internal package-private variations are constructed using static factory methods from the 4 outer public classes for doubles, floats, items, and longs, respectively + + +### Comparing the KLL Sketches with the original classic Quantiles Sketches -### Differences of KllSketch from the original quantiles DoublesSketch +The usage of KllDoublesSketch is very similar to the classic quantiles DoublesSketch. -* KLL has a smaller size for the same accuracy. -* KLL is slightly faster to update. -* The KLL parameter K doesn't have to be power of 2. -* KLL operates with either float values or double values. +* The key feature of KLL sketch is its compactness for a given accuracy. KLL has a much smaller size for the same accuracy (see the plots below). +* The KLL parameter K, which affects accuracy and size, doesn't have to be power of 2. The default K of 200 was chosen to yield approximately the same normalized rank error (1.65%) as the classic quantiles DoublesSketch (K=128, error 1.73%). +* The classic quantiles sketch only has double and item(generic) input types, while KLL (as mentioned above) is implemented with double, float, long, and item(generic) types. * KLL uses a merge method rather than a union object. -The starting point for the comparison is setting K in such a way that rank error would be approximately the same. As pointed out above, the default K for both sketches should achieve this. Here is the comparison of the single-sided normalized rank error (getRank() method) for the default K: + +### Plot Comparisons of KllDoublesSketch vs. classic Quantiles DoublesSketch + +The starting point for the plot comparisons is setting K in such a way that rank error would be approximately the same. As pointed out above, the default K for both sketches should achieve this. Here is the comparison of the single-sided normalized rank error (getRank() method) for the default K: RankError @@ -75,3 +108,18 @@ Below is the accuracy per byte measure (the higher the better). Suppose rank err Below is the update() method speed: UpdateTime + + +### Simple Java KLL Floats example + +``` +import org.apache.datasketches.kll.KllFloatsSketch; + +KllFloatsSketch sketch = KllFloatsSketch.newHeapInstance(); +int n = 1000000; +for (int i = 0; i < n; i++) { + sketch.update(i); +} +float median = sketch.getQuantile(0.5); +double rankOf1000 = sketch.getRank(1000); +``` diff --git a/docs/Quantiles/ClassicQuantilesSketch.md b/docs/Quantiles/ClassicQuantilesSketch.md index cc2931f88..be17afec6 100644 --- a/docs/Quantiles/ClassicQuantilesSketch.md +++ b/docs/Quantiles/ClassicQuantilesSketch.md @@ -19,8 +19,36 @@ layout: doc_page specific language governing permissions and limitations under the License. --> +## Contents + +* [Introduction to the Quantile Sketches](https://datasketches.apache.org/docs/QuantilesAll/QuantilesOverview.html) +* [Classic Quantiles Sketch](#classic-quantiles-sketch) + * [Classsic Quantiles Sketch Overview](https://datasketches.apache.org/docs/Quantiles/QuantilesSketchOverview.html) + * [Accuracy and Size](#accuracy-and-size) + * [Absolute vs Relative Error](#absolute-vs-relative-error) + * [Quantiles Sketches and Data Independence](#data-independence) + * [Accuracy Table](#accuracy-table) + * [Accuracy Plots](#accuracy-plots) +* Examples + * [Classic Quantiles Java Example](https://datasketches.apache.org/docs/Quantiles/QuantilesJavaExample.html) + * [Classic Quantiles Pig UDFs](https://datasketches.apache.org/docs/Quantiles/QuantilesPigUDFs.html) + * [Classic Quantiles Hive UDFs](https://datasketches.apache.org/docs/Quantiles/QuantilesHiveUDFs.html) +* Classic Quantiles Studies + * [Druid Approximate Histogram](https://datasketches.apache.org/docs/QuantilesStudies/DruidApproxHistogramStudy.html) + * [Moments Sketch Study](https://datasketches.apache.org/docs/QuantilesStudies/MomentsSketchStudy.html) + * [Quantiles StreamA Study](https://datasketches.apache.org/docs/QuantilesStudies/QuantilesStreamAStudy.html) + * [Exact Quantiles for Studies](https://datasketches.apache.org/docs/QuantilesStudies/ExactQuantiles.html) +* Tutorials + * [Sketching Quantiles and Ranks Tutorial](https://datasketches.apache.org/docs/QuantilesAll/SketchingQuantilesAndRanksTutorial.html) +* Theory + * [Relative Error Streaming Quantiles](https://arxiv.org/abs/2004.01668) + * [More References](https://datasketches.apache.org/docs/QuantilesAll/QuantilesReferences.html) + + + # Classic Quantiles Sketch + ## Quantiles Sketches Accuracy and Size Please review the Quantiles [Tutorial]({{site.docs_dir}}/QuantilesAll/SketchingQuantilesAndRanksTutorial.html). @@ -29,13 +57,14 @@ the overall size of the sketch. Accuracy for quantiles sketches is specified and measured with respect to the *rank* only, not the values. + ### Absolute vs Relative Error The Quantiles/DoublesSketch and the KLL Sketch have *absolute error*. For example, a specified accuracy of 1% at the median (rank = 0.50) means that the true value (if you could extract it from the set) should be between *getQuantile(0.49)* and *getQuantile(0.51)*. This same 1% error applied at a rank of 0.95 means that the true value should be between *getQuantile(0.94)* and *getQuantile(0.96)*. In other words, the error is a fixed +/- epsilon for the entire range of rank values. The ReqSketch, however, has relative rank error and the user can choose which end of the rank domain should have high accuracy. Refer to the sketch documentation for more information. - + ## Quantiles Sketches and Data Independence A *sketch* is an implementation of a *streaming algorithm*. By definition, a sketch has only one chance to examine each item of the stream. It is this property that makes a sketch a *streaming* algorithm and useful for real-time analysis of very large streams that may be impractical to actually store. @@ -43,7 +72,8 @@ We also assume that the sketch knows nothing about the input data stream: its le The only thing the user needs to know is how to extract the values from the stream so that they can be fed into the sketch. It is reasonable that the user knows the *type* of values in the stream: e.g., are they alphanumeric strings, numeric strings, or numeric primitives. These properties may determine the type of sketch to use as well as how to extract the appropriate quantities to feed into the sketch. -## Accuracy Information for the org.apache.datasketches.quantiles Sketch Package + +## Accuracy Table for the Classic Quantiles Sketch A k of 256 produces a normalized rank error of less than 1%. For example, the median value returned from getQuantile(0.5) will be between the actual values from the hypothetically sorted array of input values at normalized ranks of 0.49 and 0.51, with @@ -95,6 +125,9 @@ Table Guide for Quantiles DoublesSketch Size in Bytes and Approximate Error: 4,294,967,295 | 3,744 7,200 13,856 26,656 51,232 98,336 188,448 360,480 688,160 1,310,752 2,490,400 4,718,624 + +## Accuracy Plots + The following graphs illustrate the ability of the Quantiles DoublesSketch to characterize value distributions. * 1024 (n) values (trueUnsortedValues) were generated from Random's nextGaussian(). These values were then sorted (trueSortedValues) and assigned diff --git a/docs/Quantiles/QuantilesSketchOverview.md b/docs/Quantiles/QuantilesSketchOverview.md index a06299b2f..85003d54b 100644 --- a/docs/Quantiles/QuantilesSketchOverview.md +++ b/docs/Quantiles/QuantilesSketchOverview.md @@ -19,7 +19,15 @@ layout: doc_page specific language governing permissions and limitations under the License. --> -# Quantiles Sketch Overview +## Contents + +* [Classic Quantiles Sketch Overview](#overview) +* [Numeric Quantiles](#numeric-quantiles) +* [Extending Generic Quantiles Classes](#extending) +* [Implementation Notes](#notes) + + +# Classic Quantiles Sketch Overview Package: org.apache.datasketches.quantiles @@ -30,12 +38,10 @@ Probability Mass Function, getPMF(), and the Cumulative Distribution Function, g * **NOTE:** See also the KLL Sketch. -### Section links: -* [Section 1](#Section 1) Numeric Quantiles -* [Section 2](#Section 2) Extending Generic Quantiles Classes -* [Section 3](#Section 3) Implementation Notes -## Numeric Quantiles + + +## Numeric Quantiles Consider this real data example of a stream of 230 million time-spent events collected from one our systems for a period of just 30 minutes. Each event records the amount of time in milliseconds that a user spends on a web page before moving to a different web page by taking some action, such as a click. @@ -175,8 +181,8 @@ Code examples are best gleaned from the test code that exercises all the various ### END SKETCH SUMMARY */ - -## Extending Generic Quantiles Classes + +## Extending Generic Quantiles Classes Any item type that is comparable, or for which you can create a Comparator, can also be analyzed by extending the abstract generic classes for that particular item. @@ -240,8 +246,8 @@ Then obtain the split point values that equally partition the data into 10 parti Using a simple binary search you can now split your data into the 10 partitions. - -## Implementation Notes + +## Implementation Notes The quantiles algorithm is an implementation of the Low Discrepancy Mergeable Quantiles Sketch, using double values, described in section 3.2 of the journal version of the paper "Mergeable Summaries" by Agarwal, Cormode, Huang, Phillips, Wei, and Yi. diff --git a/src/main/resources/docgen/toc.json b/src/main/resources/docgen/toc.json index 051dde2d8..349f66896 100644 --- a/src/main/resources/docgen/toc.json +++ b/src/main/resources/docgen/toc.json @@ -53,36 +53,10 @@ { "class":"Dropdown", "desc" : "Quantiles And Histograms", "array": [ - {"class":"Doc", "desc" : "Quantiles and Ranks Tutorial", "dir" : "QuantilesAll", "file": "SketchingQuantilesAndRanksTutorial"}, {"class":"Doc", "desc" : "Quantiles Overview", "dir" : "QuantilesAll", "file": "QuantilesOverview" }, - {"class":"Doc", "desc" : "KLL Sketch Family", "dir" : "KLL", "file": "KLLSketch" }, - {"class":"Doc", "desc" : "KLL Sketch Accuracy and Size", "dir" : "KLL", "file": "KLLAccuracyAndSize" }, - {"class":"Doc", "desc" : "Understanding KLL Bounds", "dir" : "KLL", "file": "UnderstandingKLLBounds" }, - {"class":"Doc", "desc" : "REQ Floats sketch", "dir" : "REQ", "file": "ReqSketch" }, + {"class":"Doc", "desc" : "KLL Sketches", "dir" : "KLL", "file": "KLLSketch" }, {"class":"Doc", "desc" : "Classic Quantiles Sketches", "dir" : "Quantiles", "file": "ClassicQuantilesSketch" }, - - { "class":"Dropdown", "desc" : "Quantiles Examples", "array": - [ - {"class":"Doc", "desc" : "Quantiles Sketch Java Example", "dir" : "Quantiles", "file": "QuantilesJavaExample" }, - {"class":"Doc", "desc" : "KLL Quantiles Sketch C++ Example", "dir" : "KLL", "file": "KLLCppExample" }, - {"class":"Doc", "desc" : "Quantiles Sketch Pig UDFs", "dir" : "Quantiles", "file": "QuantilesPigUDFs" }, - {"class":"Doc", "desc" : "Quantiles Sketch Hive UDFs", "dir" : "Quantiles", "file": "QuantilesHiveUDFs" }, - ] - }, - { "class":"Dropdown", "desc" : "Quantiles Studies", "array": - [ - {"class":"Doc", "desc" : "Druid Approximate Histogram", "dir" : "QuantilesStudies", "file": "DruidApproxHistogramStudy" }, - {"class":"Doc", "desc" : "Moments Sketch Study", "dir" : "QuantilesStudies", "file": "MomentsSketchStudy" }, - {"class":"Doc", "desc" : "Quantiles StreamA Study", "dir" : "QuantilesStudies", "file": "QuantilesStreamAStudy" }, - {"class":"Doc", "desc" : "Exact Quantiles for Studies", "dir" : "QuantilesStudies", "file": "ExactQuantiles" }, - ] - }, - { "class":"Dropdown", "desc" : "Quantiles Sketch Theory", "array": - [ - {"class":"Doc", "desc" : "Optimal Quantile Approximation in Streams", "dir" : "", "file": "Quantiles_KLL", "pdf":"true" }, - {"class":"Doc", "desc" : "Quantiles References", "dir" : "QuantilesAll", "file": "QuantilesReferences" }, - ] - }, + {"class":"Doc", "desc" : "REQ Floats Sketch", "dir" : "REQ", "file": "ReqSketch" }, ] },