Apache DataSketches library functions for Google BigQuery

User-Defined Aggregate Functions (UDAFs) and non-aggregate (scalar) functions (UDFs) for BigQuery SQL engine.

Please visit the main Apache DataSketches website for more information about DataSketches library.

If you are interested in making contributions to this project please see our Community page for how to contact us.

Requirements

Requires Emscripten (emcc compiler)
Requires a link to datasketches-cpp in this repository
Requires make utility
Requires Google Cloud CLI

Building, Installing, and Testing

Requires setting environment variables
- JS_BUCKET: to hold compiled artifacts
- BQ_PROJECT: location of stored SQL functions (routines)
- BQ_DATASET: location of stored SQL functions (routines)

make          # performs compilation
make install  # upload to $JS_BUCKET & create functions in $BQ_PROJECT.$BQ_DATASET
make test     # runs predefined tests in BQ

The above steps can be executed in the root directory to install everything, or can be run from an individual sketch directory to install only that particular sketch.

BigQuery Sketch Functions

DataSketches are probabilistic data structures that can process massive amounts of data and return very accurate results with a small memory footprint. Because of this, DataSketches are particularly useful for "big data" use cases such as streaming analytics and data warehousing.

This package includes BigQuery UD(A)Fs for the following Sketch types:

Sketch Type	Description
Frequent Items Sketch	Estimates the frequency of items in a stream to find the most frequent ones (e.g., top-selling products, most active users).
CPC Sketch	A very compact sketch for estimating the number of unique items, especially in distributed environments.
HLL Sketch	Memory-efficient sketch for estimating the number of unique items, optimized for high accuracy.
KLL Sketch	Estimates the distribution of values, allowing you to find quantiles (like median, percentiles) without storing all the data.
Theta Sketch	Estimates unique items and supports set operations (union, intersection, difference) on those items.
Tuple Sketch	Similar to Theta Sketch but allows associating values with each unique item, enabling operations like sum, min, max on those values.
TDigest	Another algorithm and very compact data structure for estimating quantiles and ranks of numeric values.

CPC Sketch Functions

Description: CPC sketches are a compact and efficient way to estimate the cardinality of a dataset, especially in distributed environments. They provide accurate estimates with low memory usage and are particularly useful for applications like counting unique users, analyzing website traffic, or tracking distinct events.

For more information: CPC Sketches

Function Name	Function Type	Signature	Description
cpc_sketch_agg_union	AGGREGATE	(sketch BYTES) -> BYTES	Creates a sketch that represents the union of the given column of sketches. Param sketch: the column of sketches. Each as BYTES. Defaults: lg_k = 12, seed = 9001. Returns: a Compact, Compressed CPC Sketch, as BYTES.
cpc_sketch_agg_string	AGGREGATE	(str STRING) -> BYTES	Creates a sketch that represents the cardinality of the given STRING column. Param str: the STRING column of identifiers. Defaults: lg_k = 12, seed = 9001. Returns: a Compact, Compressed CPC Sketch, as BYTES
cpc_sketch_agg_int64	AGGREGATE	(value INT64) -> BYTES	Creates a sketch that represents the cardinality of the given INT64 column. Param value: the INT64 column of identifiers. Defaults: lg_k = 12, seed = 9001. Returns: a Compact, Compressed CPC Sketch, as BYTES
cpc_sketch_agg_string_lgk_seed	AGGREGATE	(str STRING, params STRUCT<lg_k BYTEINT, seed INT64> NOT AGGREGATE) -> BYTES	Creates a sketch that represents the cardinality of the given STRING column. Param str: the STRING column of identifiers. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 26]. Param seed: the seed to be used by the underlying hash function. Returns: a Compact, Compressed CPC Sketch, as BYTES
cpc_sketch_agg_union_lgk_seed	AGGREGATE	(sketch BYTES, params STRUCT<lg_k BYTEINT, seed INT64> NOT AGGREGATE) -> BYTES	Creates a sketch that represents the union of the given column of sketches. Param sketch: the column of sketches. Each as BYTES. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 26]. Param seed: This is used to confirm that the given sketches were configured with the correct seed. Returns: a Compact, Compressed CPC Sketch, as BYTES.
cpc_sketch_agg_int64_lgk_seed	AGGREGATE	(value INT64, params STRUCT<lg_k BYTEINT, seed INT64> NOT AGGREGATE) -> BYTES	Creates a sketch that represents the cardinality of the given INT64 column. Param value: the INT64 column of identifiers. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 26]. Param seed: the seed to be used by the underlying hash function. Returns: a Compact, Compressed CPC Sketch, as BYTES
cpc_sketch_get_estimate	SCALAR	(sketch BYTES) -> FLOAT64	Gets cardinality estimate and bounds from given sketch. Param sketch: The given sketch to query as BYTES. Defaults: seed = 9001. Returns: a FLOAT64 value as the cardinality estimate.
cpc_sketch_to_string	SCALAR	(sketch BYTES) -> STRING	Returns a summary string that represents the state of the given sketch. Param sketch the given sketch as BYTES. Defaults: seed = 9001. Returns: a STRING that represents the state of the given sketch.
cpc_sketch_get_estimate_seed	SCALAR	(sketch BYTES, seed INT64) -> FLOAT64	Gets cardinality estimate and bounds from given sketch. Param sketch: The given sketch to query as BYTES. Param seed: This is used to confirm that the given sketch was configured with the correct seed. Returns: a FLOAT64 value as the cardinality estimate.
cpc_sketch_to_string_seed	SCALAR	(sketch BYTES, seed INT64) -> STRING	Returns a summary string that represents the state of the given sketch. Param sketch the given sketch as BYTES. Param seed: This is used to confirm that the given sketch was configured with the correct seed. Returns: a STRING that represents the state of the given sketch.
cpc_sketch_union	SCALAR	(sketchA BYTES, sketchB BYTES) -> BYTES	Computes a sketch that represents the scalar union of the two given sketches. Param sketchA: the first sketch as BYTES. Param sketchB: the second sketch as BYTES. Defaults: lg_k = 12, seed = 9001. Returns: a CPC Sketch, as BYTES.
cpc_sketch_get_estimate_and_bounds	SCALAR	(sketch BYTES, num_std_devs BYTEINT) -> STRUCT<estimate FLOAT64, lower_bound FLOAT64, upper_bound FLOAT64>	Gets cardinality estimate and bounds from given sketch. Param sketch: The given sketch to query as bytes. Param num_std_devs: The returned bounds will be based on the statistical confidence interval determined by the given number of standard deviations from the returned estimate. This number may be one of {1,2,3}, where 1 represents 68% confidence, 2 represents 95% confidence and 3 represents 99.7% confidence. For example, if the given num_std_devs = 2 and the returned values are {1000, 990, 1010} that means that with 95% confidence, the true value lies within the range [990, 1010]. Defaults: seed = 9001. Returns: a STRUCT with 3 FLOAT64 values as {estimate, lower_bound, upper_bound}.
cpc_sketch_union_lgk_seed	SCALAR	(sketchA BYTES, sketchB BYTES, lg_k BYTEINT, seed INT64) -> BYTES	Computes a sketch that represents the scalar union of the two given sketches. Param sketchA: the first sketch as BYTES. Param sketchB: the second sketch as BYTES. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 26]. Param seed: This is used to confirm that the given sketches were configured with the correct seed. Returns: a CPC Sketch, as BYTES.
cpc_sketch_get_estimate_and_bounds_seed	SCALAR	(sketch BYTES, num_std_devs BYTEINT, seed INT64) -> STRUCT<estimate FLOAT64, lower_bound FLOAT64, upper_bound FLOAT64>	Gets cardinality estimate and bounds from given sketch. Param sketch: The given sketch to query as bytes. Param num_std_devs: The returned bounds will be based on the statistical confidence interval determined by the given number of standard deviations from the returned estimate. This number may be one of {1,2,3}, where 1 represents 68% confidence, 2 represents 95% confidence and 3 represents 99.7% confidence. For example, if the given num_std_devs = 2 and the returned values are {1000, 990, 1010} that means that with 95% confidence, the true value lies within the range [990, 1010]. Param seed: This is used to confirm that the given sketch was configured with the correct seed. Returns: a STRUCT with 3 FLOAT64 values as {estimate, lower_bound, upper_bound}.

Examples:

# using defaults
# expected 5
select `$BQ_DATASET`.cpc_sketch_get_estimate(
  `$BQ_DATASET`.cpc_sketch_union(
    (select `$BQ_DATASET`.cpc_sketch_agg_string(str) from unnest(["a", "b", "c"]) as str),
    (select `$BQ_DATASET`.cpc_sketch_agg_string(str) from unnest(["c", "d", "e"]) as str)
  )
);

# using full signatures
# expected 5
select `$BQ_DATASET`.cpc_sketch_get_estimate_seed(
  `$BQ_DATASET`.cpc_sketch_union_lgk_seed(
    (select `$BQ_DATASET`.cpc_sketch_agg_string_lgk_seed(str, struct<byteint, int64>(10, 111)) from unnest(["a", "b", "c"]) as str),
    (select `$BQ_DATASET`.cpc_sketch_agg_string_lgk_seed(str, struct<byteint, int64>(10, 111)) from unnest(["c", "d", "e"]) as str),
    10,
    111
  ),
  111
);

# using defaults
create or replace table `$BQ_DATASET`.cpc_sketch(sketch bytes);

insert into `$BQ_DATASET`.cpc_sketch
(select `$BQ_DATASET`.cpc_sketch_agg_int64(value) from unnest(GENERATE_ARRAY(1, 10000, 1)) as value);
insert into `$BQ_DATASET`.cpc_sketch
(select `$BQ_DATASET`.cpc_sketch_agg_int64(value) from unnest(GENERATE_ARRAY(100000, 110000, 1)) as value);

select `$BQ_DATASET`.cpc_sketch_to_string(sketch) from `$BQ_DATASET`.cpc_sketch;

# expected about 20000
select `$BQ_DATASET`.cpc_sketch_get_estimate(
  `$BQ_DATASET`.cpc_sketch_agg_union(sketch)
) from `$BQ_DATASET`.cpc_sketch;

select `$BQ_DATASET`.cpc_sketch_get_estimate_and_bounds(
  `$BQ_DATASET`.cpc_sketch_agg_union(sketch),
  3
) from `$BQ_DATASET`.cpc_sketch;

drop table `$BQ_DATASET`.cpc_sketch;

# using full signatures
create or replace table `$BQ_DATASET`.cpc_sketch(sketch bytes);

insert into `$BQ_DATASET`.cpc_sketch
(select `$BQ_DATASET`.cpc_sketch_agg_int64_lgk_seed(value, struct<byteint, int64>(10, 111)) from unnest(GENERATE_ARRAY(1, 10000, 1)) as value);
insert into `$BQ_DATASET`.cpc_sketch
(select `$BQ_DATASET`.cpc_sketch_agg_int64_lgk_seed(value, struct<byteint, int64>(10, 111)) from unnest(GENERATE_ARRAY(100000, 110000, 1)) as value);

select `$BQ_DATASET`.cpc_sketch_to_string_seed(sketch, 111) from `$BQ_DATASET`.cpc_sketch;

# expected about 20000
select `$BQ_DATASET`.cpc_sketch_get_estimate_seed(
  `$BQ_DATASET`.cpc_sketch_agg_union_lgk_seed(sketch, struct<byteint, int64>(10, 111)),
  111
) from `$BQ_DATASET`.cpc_sketch;

select `$BQ_DATASET`.cpc_sketch_get_estimate_and_bounds_seed(
  `$BQ_DATASET`.cpc_sketch_agg_union_lgk_seed(sketch, struct<byteint, int64>(10, 111)),
  3,
  111
) from `$BQ_DATASET`.cpc_sketch;

drop table `$BQ_DATASET`.cpc_sketch;

FI Sketch Functions

Description: Frequent Items (FI) sketches are used to estimate the frequencies of items in a dataset. They are effective for identifying the most frequent items, such as the top products purchased or the most popular search queries.

For more information: Frequency Sketches

Function Name	Function Type	Signature	Description
frequent_strings_sketch_merge	AGGREGATE	(sketch BYTES, lg_max_map_size BYTEINT NOT AGGREGATE) -> BYTES	Merges sketches from the given column. Param sketch: the column of values. Param lg_max_map_size: the sketch accuracy/size parameter as an integer not less than 3. Returns: a serialized Frequent Strings sketch as BYTES.
frequent_strings_sketch_build	AGGREGATE	(item STRING, weight INT64, lg_max_map_size BYTEINT NOT AGGREGATE) -> BYTES	Creates a sketch that represents frequencies of the given column. Param item: the column of STRING values. Param weight: the amount by which the weight of the item should be increased. Param lg_max_map_size: the sketch accuracy/size parameter as a BYTEINT not less than 3. Returns: a Frequent Strings Sketch, as bytes.
frequent_strings_sketch_to_string	SCALAR	(sketch BYTES) -> STRING	Returns a summary string that represents the state of the given sketch. Param sketch: the given sketch as sketch encoded bytes. Returns: a string that represents the state of the given sketch.
frequent_strings_sketch_get_result	SCALAR	(sketch BYTES, error_type STRING, threshold INT64) -> ARRAY<STRUCT<item STRING, estimate INT64, lower_bound INT64, upper_bound INT64>>	Returns an array of rows that include frequent items, estimates, lower and upper bounds given an error_type and a threshold. Param sketch: the given sketch as sketch encoded bytes. Param error_type: determines whether no false positives or no false negatives are desired. Param threshold: a threshold to include items in the result list. If NULL, the maximum error of the sketch is used as a threshold. Returns: an array of frequent items with frequency estimates, lower and upper bounds.

Examples:

select `$BQ_DATASET`.frequent_strings_sketch_to_string(`$BQ_DATASET`.frequent_strings_sketch_build(str, 1, 5)) from unnest(["a", "b", "c"]) as str;

create or replace table `$BQ_DATASET`.fs_sketch(sketch bytes);

insert into `$BQ_DATASET`.fs_sketch
(select `$BQ_DATASET`.frequent_strings_sketch_build(str, 1, 5) from unnest(["a", "b", "c", "d"]) as str);

insert into `$BQ_DATASET`.fs_sketch
(select `$BQ_DATASET`.frequent_strings_sketch_build(str, 1, 5) from unnest(["a", "a", "c"]) as str);

select `$BQ_DATASET`.frequent_strings_sketch_get_result(`$BQ_DATASET`.frequent_strings_sketch_merge(sketch, 5), "NO_FALSE_NEGATIVES", null) from `$BQ_DATASET`.fs_sketch;

drop table `$BQ_DATASET`.fs_sketch;

HLL Sketch Functions

Description: HyperLogLog (HLL) sketches are another type of cardinality estimation sketch. They are known for their high accuracy and low memory consumption, making them suitable for large datasets and real-time analytics.

For more information: HLL Sketches

Function Name	Function Type	Signature	Description
hll_sketch_agg_string	AGGREGATE	(str STRING) -> BYTES	Creates a sketch that represents the cardinality of the given STRING column. Param str: the STRING column of identifiers. Defaults: lg_k = 12, tgt_type = HLL_4. Returns: an HLL Sketch, as BYTES.
hll_sketch_agg_union	AGGREGATE	(sketch BYTES) -> BYTES	Creates a sketch that represents the union of the given column of sketches. Param sketch: the column of sketches. Each as BYTES. Defaults: lg_k = 12, tgt_type = HLL_4. Returns: an HLL Sketch, as BYTES.
hll_sketch_agg_int64	AGGREGATE	(value INT64) -> BYTES	Creates a sketch that represents the cardinality of the given INT64 column. Param value: the INT64 column of identifiers. Defaults: lg_k = 12, tgt_type = HLL_4. Returns: an HLL Sketch, as BYTES.
hll_sketch_agg_string_lgk_type	AGGREGATE	(str STRING, params STRUCT<lg_k BYTEINT, tgt_type STRING> NOT AGGREGATE) -> BYTES	Creates a sketch that represents the cardinality of the given STRING column. Param str: the STRING column of identifiers. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 21]. Param tgt_type: The HLL type to use: one of {"HLL_4", "HLL_6", "HLL_8"}. Returns: an HLL Sketch, as BYTES.
hll_sketch_agg_union_lgk_type	AGGREGATE	(sketch BYTES, params STRUCT<lg_k BYTEINT, tgt_type STRING> NOT AGGREGATE) -> BYTES	Creates a sketch that represents the union of the given column of sketches. Param sketch: the column of sketches. Each as BYTES. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 21]. Param tgt_type: The HLL type to use: one of {"HLL_4", "HLL_6", "HLL_8"}. Returns: an HLL Sketch, as BYTES.
hll_sketch_agg_int64_lgk_type	AGGREGATE	(value INT64, params STRUCT<lg_k BYTEINT, tgt_type STRING> NOT AGGREGATE) -> BYTES	Creates a sketch that represents the cardinality of the given INT64 column. Param value: the INT64 column of identifiers. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 21]. Param tgt_type: The HLL type to use: one of {"HLL_4", "HLL_6", "HLL_8"}. Returns: an HLL Sketch, as BYTES.
hll_sketch_get_estimate	SCALAR	(sketch BYTES) -> FLOAT64	Returns a summary string that represents the state of the given sketch. Param sketch: the given sketch as BYTES. Returns: the cardinality estimate as FLOAT64 value.
hll_sketch_to_string	SCALAR	(sketch BYTES) -> STRING	Returns a summary string that represents the state of the given sketch. Param sketch: the given sketch as BYTES. Returns: a STRING that represents the state of the given sketch.
hll_sketch_union	SCALAR	(sketchA BYTES, sketchB BYTES) -> BYTES	Computes a sketch that represents the union of the two given sketches. Param sketchA: the first sketch as bytes. Param sketchB: the second sketch as bytes. Defaults: lg_k = 12, tgt_type = HLL_4. Returns: an HLL Sketch, as BYTES.
hll_sketch_union_lgk_type	SCALAR	(sketchA BYTES, sketchB BYTES, lg_k BYTEINT, tgt_type STRING) -> BYTES	Computes a sketch that represents the union of the two given sketches. Param sketchA: the first sketch as bytes. Param sketchB: the second sketch as bytes. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 21]. Param tgt_type: The HLL type to use: one of {"HLL_4", "HLL_6", "HLL_8"}. Returns: an HLL Sketch, as BYTES.
hll_sketch_get_estimate_and_bounds	SCALAR	(sketch BYTES, num_std_devs BYTEINT) -> STRUCT<estimate FLOAT64, lower_bound FLOAT64, upper_bound FLOAT64>	Gets cardinality estimate and bounds from given sketch. Param sketch: The given sketch to query as BYTES. Param num_std_devs: The returned bounds will be based on the statistical confidence interval determined by the given number of standard deviations from the returned estimate. This number may be one of {1,2,3}, where 1 represents 68% confidence, 2 represents 95% confidence and 3 represents 99.7% confidence. For example, if the given num_std_devs = 2 and the returned values are {1000, 990, 1010} that means that with 95% confidence, the true value lies within the range [990, 1010]. Returns: a struct with 3 FLOAT64 values as {estimate, lower_bound, upper_bound}.

Examples:

# expected 3
select `$BQ_DATASET`.hll_sketch_get_estimate(`$BQ_DATASET`.hll_sketch_agg_string(s)) from unnest(["a", "b", "c"]) as s;

select `$BQ_DATASET`.hll_sketch_to_string(`$BQ_DATASET`.hll_sketch_agg_string(s)) from unnest(["a", "b", "c"]) as s;

# expected 5
select `$BQ_DATASET`.hll_sketch_get_estimate_and_bounds(
  `$BQ_DATASET`.hll_sketch_union_lgk_type(
    (select `$BQ_DATASET`.hll_sketch_agg_string_lgk_type(str, struct<byteint, string>(10, "HLL_8")) from unnest(["a", "b", "c"]) as str),
    (select `$BQ_DATASET`.hll_sketch_agg_string_lgk_type(str, struct<byteint, string>(10, "HLL_8")) from unnest(["c", "d", "e"]) as str),
    10,
    "HLL_8"
  ),
  2
);

select `$BQ_DATASET`.hll_sketch_to_string(
  `$BQ_DATASET`.hll_sketch_union_lgk_type(
    (select `$BQ_DATASET`.hll_sketch_agg_string_lgk_type(str, struct<byteint, string>(10, "HLL_8")) from unnest(["a", "b", "c"]) as str),
    (select `$BQ_DATASET`.hll_sketch_agg_string_lgk_type(str, struct<byteint, string>(10, "HLL_8")) from unnest(["c", "d", "e"]) as str),
    10,
    "HLL_8"
  )
);

create or replace table `$BQ_DATASET`.hll_sketch(sketch bytes);

insert into `$BQ_DATASET`.hll_sketch
(select `$BQ_DATASET`.hll_sketch_agg_int64(value) from unnest(GENERATE_ARRAY(1, 10000, 1)) as value);
insert into `$BQ_DATASET`.hll_sketch
(select `$BQ_DATASET`.hll_sketch_agg_int64(value) from unnest(GENERATE_ARRAY(100000, 110000, 1)) as value);

# expected estimate about 20000
select `$BQ_DATASET`.hll_sketch_to_string(
  `$BQ_DATASET`.hll_sketch_agg_union(sketch)
) from `$BQ_DATASET`.hll_sketch;

select `$BQ_DATASET`.hll_sketch_to_string(
  `$BQ_DATASET`.hll_sketch_agg_union_lgk_type(sketch, struct<byteint, string>(10, "HLL_6"))
) from `$BQ_DATASET`.hll_sketch;

drop table `$BQ_DATASET`.hll_sketch;

create or replace table `$BQ_DATASET`.hll_sketch(sketch bytes);

insert into `$BQ_DATASET`.hll_sketch
(select `$BQ_DATASET`.hll_sketch_agg_int64_lgk_type(value, struct<int, string>(8, "HLL_6")) from unnest(GENERATE_ARRAY(1, 10000, 1)) as value);
insert into `$BQ_DATASET`.hll_sketch
(select `$BQ_DATASET`.hll_sketch_agg_int64_lgk_type(value, struct<int, string>(8, "HLL_6")) from unnest(GENERATE_ARRAY(100000, 110000, 1)) as value);

# expected estimate about 20000
select `$BQ_DATASET`.hll_sketch_to_string(
  `$BQ_DATASET`.hll_sketch_agg_union_lgk_type(sketch, struct<byteint, string>(8, "HLL_6"))
) from `$BQ_DATASET`.hll_sketch;

drop table `$BQ_DATASET`.hll_sketch;

KLL Sketch Functions

Description: KLL sketches are quantile sketches that provide approximate quantiles for a dataset. They are useful for understanding the distribution of data and calculating percentiles, such as the median or 95th percentile.

For more information: KLL Sketches

Function Name	Function Type	Signature	Description
kll_sketch_float_build	AGGREGATE	(value FLOAT64) -> BYTES	Creates a sketch that represents the distribution of the given column. Param value: the column of FLOAT64 values. Defaults: k = 200. Returns: a KLL Sketch, as bytes.
kll_sketch_float_merge	AGGREGATE	(sketch BYTES) -> BYTES	Merges sketches from the given column. Param sketch: the column of values. Defaluts: k = 200. Returns: a serialized KLL sketch as BYTES.
kll_sketch_float_merge_k	AGGREGATE	(sketch BYTES, k INT NOT AGGREGATE) -> BYTES	Merges sketches from the given column. Param sketch: the column of values. Param k: the sketch accuracy/size parameter as an integer in the range [8, 65535]. Returns: a serialized KLL sketch as BYTES.
kll_sketch_float_build_k	AGGREGATE	(value FLOAT64, k INT NOT AGGREGATE) -> BYTES	Creates a sketch that represents the distribution of the given column. Param value: the column of FLOAT64 values. Param k: the sketch accuracy/size parameter as an INT in the range [8, 65535]. Returns: a KLL Sketch, as bytes.
kll_sketch_float_get_n	SCALAR	(sketch BYTES) -> INT64	Returns the length of the input stream. Param sketch: the given sketch as BYTES. Returns: stream length as INT64
kll_sketch_float_get_min_value	SCALAR	(sketch BYTES) -> FLOAT64	Returns the minimum value of the input stream. Param sketch: the given sketch as BYTES. Returns: min value as FLOAT64
kll_sketch_float_to_string	SCALAR	(sketch BYTES) -> STRING	Returns a summary string that represents the state of the given sketch. Param sketch: the given sketch as sketch encoded bytes. Returns: a string that represents the state of the given sketch.
kll_sketch_float_get_num_retained	SCALAR	(sketch BYTES) -> INT64	Returns the number of retained items (samples) in the sketch. Param sketch: the given sketch as BYTES. Returns: number of retained items as INT64
kll_sketch_float_get_max_value	SCALAR	(sketch BYTES) -> FLOAT64	Returns the maximum value of the input stream. Param sketch: the given sketch as BYTES. Returns: max value as FLOAT64
kll_sketch_float_get_normalized_rank_error	SCALAR	(sketch BYTES, pmf BOOL) -> FLOAT64	Returns the approximate rank error of the given sketch normalized as a fraction between zero and one. Param sketch: the given sketch as BYTES. Param pmf: if true, returns the "double-sided" normalized rank error for the get_PMF() function. Otherwise, it is the "single-sided" normalized rank error for all the other queries. Returns: normalized rank error as FLOAT64
kll_sketch_float_get_rank	SCALAR	(sketch BYTES, value FLOAT64, inclusive BOOL) -> FLOAT64	Returns an approximation to the normalized rank, on the interval [0.0, 1.0], of the given value. Param sketch: the given sketch in serialized form. Param value: value to be ranked. Param inclusive: if true the weight of the given value is included into the rank. Returns: an approximate rank of the given value.
kll_sketch_float_get_pmf	SCALAR	(sketch BYTES, split_points ARRAY, inclusive BOOL) -> ARRAY	Returns an approximation to the Probability Mass Function (PMF) of the input stream as an array of probability masses defined by the given split_points. Param sketch: the given sketch as BYTES. Param split_points: an array of M unique, monotonically increasing values (of the same type as the input values) that divide the input value domain into M+1 non-overlapping intervals. Each interval except for the end intervals starts with a split-point and ends with the next split-point in sequence. The first interval starts below the minimum value of the stream (corresponding to a zero rank or zero probability), and ends with the first split-point The last (m+1)th interval starts with the last split-point and ends above the maximum value of the stream (corresponding to a rank or probability of 1.0). Param inclusive: if true and the upper boundary of an interval equals a value retained by the sketch, the interval will include that value. If the lower boundary of an interval equals a value retained by the sketch, the interval will exclude that value. If false and the upper boundary of an interval equals a value retained by the sketch, the interval will exclude that value. If the lower boundary of an interval equals a value retained by the sketch, the interval will include that value. Returns: the PMF as a FLOAT64 array of M+1 probability masses on the interval [0.0, 1.0]. The sum of the probability masses of all (m+1) intervals is 1.0.
kll_sketch_float_kolmogorov_smirnov	SCALAR	(sketchA BYTES, sketchB BYTES, pvalue FLOAT64) -> BOOL	Performs the Kolmogorov-Smirnov Test between two KLL sketches of type FLOAT64. If the given sketches have insufficient data or if the sketch sizes are too small, this will return false. Param sketchA: sketch A in serialized form. Param sketchB: sketch B in serialized form. Param pvalue: Target p-value. Typically 0.001 to 0.1, e.g. 0.05. Returns: boolean indicating whether we can reject the null hypothesis (that the sketches reflect the same underlying distribution) using the provided p-value.
kll_sketch_float_get_cdf	SCALAR	(sketch BYTES, split_points ARRAY, inclusive BOOL) -> ARRAY	Returns an approximation to the Cumulative Distribution Function (CDF) of the input stream as an array of cumulative probabilities defined by the given split_points. Param sketch: the given sketch as BYTES. Param split_points: an array of M unique, monotonically increasing values (of the same type as the input values to the sketch) that divide the input value domain into M+1 overlapping intervals. The start of each interval is below the lowest input value retained by the sketch (corresponding to a zero rank or zero probability). The end of each interval is the associated split-point except for the top interval where the end is the maximum input value of the stream. Param inclusive: if true and the upper boundary of an interval equals a value retained by the sketch, the interval will include that value. If the lower boundary of an interval equals a value retained by the sketch, the interval will exclude that value. If false and the upper boundary of an interval equals a value retained by the sketch, the interval will exclude that value. If the lower boundary of an interval equals a value retained by the sketch, the interval will include that value. Returns: the CDF as a monotonically increasing FLOAT64 array of M+1 cumulative probablities on the interval [0.0, 1.0]. The top-most probability of the returned array is always 1.0.
kll_sketch_float_get_quantile	SCALAR	(sketch BYTES, rank FLOAT64, inclusive BOOL) -> FLOAT64	Returns a value from the sketch that is the best approximation to a value from the original stream with the given rank. Param sketch: the given sketch in serialized form. Param rank: rank of a value in the hypothetical sorted stream. Param inclusive: if true, the given rank is considered inclusive (includes weight of a value) Returns: an approximate quantile associated with the given rank.

Examples:

create or replace table `$BQ_DATASET`.kll_sketch(sketch bytes);

# using default
insert into `$BQ_DATASET`.kll_sketch
(select `$BQ_DATASET`.kll_sketch_float_build(value) from unnest([1,2,3,4,5,6,7,8,9,10]) as value);

# using full signature
insert into `$BQ_DATASET`.kll_sketch
(select `$BQ_DATASET`.kll_sketch_float_build_k(value, 100) from unnest([11,12,13,14,15,16,17,18,19,20]) as value);

select `$BQ_DATASET`.kll_sketch_float_to_string(sketch) from `$BQ_DATASET`.kll_sketch;

# using default
select `$BQ_DATASET`.kll_sketch_float_to_string(`$BQ_DATASET`.kll_sketch_float_merge(sketch)) from `$BQ_DATASET`.kll_sketch;

# using full signature
select `$BQ_DATASET`.kll_sketch_float_to_string(`$BQ_DATASET`.kll_sketch_float_merge_k(sketch, 100)) from `$BQ_DATASET`.kll_sketch;

# expected 0.5
select `$BQ_DATASET`.kll_sketch_float_get_rank(`$BQ_DATASET`.kll_sketch_float_merge(sketch), 10, true) from `$BQ_DATASET`.kll_sketch;

# expected 10
select `$BQ_DATASET`.kll_sketch_float_get_quantile(`$BQ_DATASET`.kll_sketch_float_merge(sketch), 0.5, true) from `$BQ_DATASET`.kll_sketch;

# expected 20
select `$BQ_DATASET`.kll_sketch_float_get_n(`$BQ_DATASET`.kll_sketch_float_merge(sketch)) from `$BQ_DATASET`.kll_sketch;

# expected 0.5, 0.5
select `$BQ_DATASET`.kll_sketch_float_get_pmf(`$BQ_DATASET`.kll_sketch_float_merge(sketch), [10.0], true) from `$BQ_DATASET`.kll_sketch;

# expected 0.5, 1
select `$BQ_DATASET`.kll_sketch_float_get_cdf(`$BQ_DATASET`.kll_sketch_float_merge(sketch), [10.0], true) from `$BQ_DATASET`.kll_sketch;

# expected 1
select `$BQ_DATASET`.kll_sketch_float_get_min_value(`$BQ_DATASET`.kll_sketch_float_merge(sketch)) from `$BQ_DATASET`.kll_sketch;

# expected 20
select `$BQ_DATASET`.kll_sketch_float_get_max_value(`$BQ_DATASET`.kll_sketch_float_merge(sketch)) from `$BQ_DATASET`.kll_sketch;

drop table `$BQ_DATASET`.kll_sketch;

# expected about 1.3%
select `$BQ_DATASET`.kll_sketch_float_get_normalized_rank_error(`$BQ_DATASET`.kll_sketch_float_build(value), false) from unnest(generate_array(1, 10000)) as value;

select `$BQ_DATASET`.kll_sketch_float_get_num_retained(`$BQ_DATASET`.kll_sketch_float_build(value)) from unnest(generate_array(1, 10000)) as value;

# expected false
select `$BQ_DATASET`.kll_sketch_float_kolmogorov_smirnov(
  (select `$BQ_DATASET`.kll_sketch_float_build(value) from unnest([1,2,3,4,5,6,7,8,9,10]) as value),
  (select `$BQ_DATASET`.kll_sketch_float_build(value) from unnest([1,2,3,4,5,6,7,8,9,10]) as value),
  0.05
);

# expected true
select `$BQ_DATASET`.kll_sketch_float_kolmogorov_smirnov(
  (select `$BQ_DATASET`.kll_sketch_float_build(value) from unnest([1,2,3,4,5,6,7,8,9,10]) as value),
  (select `$BQ_DATASET`.kll_sketch_float_build(value) from unnest([11,12,13,14,15,16,17,18,19,20]) as value),
  0.05
);

Theta Sketch Functions

Description: Theta sketches are used for set operations like union, intersection, and difference. They are efficient for estimating the size of these operations on large datasets, enabling applications like analyzing user overlap or comparing different groups.

For more information: Theta sketches

Function Name	Function Type	Signature	Description
theta_sketch_agg_int64	AGGREGATE	(value INT64) -> BYTES	Creates a sketch that represents the cardinality of the given INT64 column. Param value: the INT64 column of identifiers. Defaults: lg_k = 12, seed = 9001, p = 1.0. Returns: a Compact, Compressed Theta Sketch, as BYTES.
theta_sketch_agg_union	AGGREGATE	(sketch BYTES) -> BYTES	Creates a sketch that represents the union of the given column of sketches. Param sketch: the column of sketches. Each as BYTES. Defaults: lg_k = 12, seed = 9001. Returns: a Compact, Compressed Theta Sketch, as BYTES.
theta_sketch_agg_string	AGGREGATE	(str STRING) -> BYTES	Creates a sketch that represents the cardinality of the given STRING column. Param str: the STRING column of identifiers. Defaults: lg_k = 12, seed = 9001, p = 1.0. Returns: a Compact, Compressed Theta Sketch, as BYTES.
theta_sketch_agg_union_lgk_seed	AGGREGATE	(sketch BYTES, params STRUCT<lg_k BYTEINT, seed INT64> NOT AGGREGATE) -> BYTES	Creates a sketch that represents the union of the given column of sketches. Param sketch: the column of sketches. Each as BYTES. Param lg_k: the sketch accuracy/size parameter as a BYTEINT in the range [4, 26]. Param seed: This is used to confirm that the given sketches were configured with the correct seed. Returns: a Compact, Compressed Theta Sketch, as BYTES.
theta_sketch_agg_int64_lgk_seed_p	AGGREGATE	(value INT64, params STRUCT<lg_k BYTEINT, seed INT64, p FLOAT64> NOT AGGREGATE) -> BYTES	Creates a sketch that represents the cardinality of the given INT64 column. Param value: the INT64 column of identifiers. Param lg_k: the sketch accuracy/size parameter as a BYTEINT in the range [4, 26]. A NULL specifies the default of 12. Param seed: the seed to be used by the underlying hash function. A NULL specifies the default of 9001. Param p: up-front sampling probability. A NULL specifies the default of 1.0. Returns: a Compact, Compressed Theta Sketch, as BYTES.
theta_sketch_agg_string_lgk_seed_p	AGGREGATE	(str STRING, params STRUCT<lg_k BYTEINT, seed INT64, p FLOAT64> NOT AGGREGATE) -> BYTES	Creates a sketch that represents the cardinality of the given STRING column. Param str: the STRING column of identifiers. Param lg_k: the sketch accuracy/size parameter as a BYTEINT in the range [4, 26]. A NULL specifies the default of 12. Param seed: the seed to be used by the underlying hash function. A NULL specifies the default of 9001. Param p: up-front sampling probability. A NULL specifies the default of 1.0. Returns: a Compact, Compressed Theta Sketch, as BYTES.
theta_sketch_get_estimate	SCALAR	(sketch BYTES) -> FLOAT64	Gets cardinality estimate and bounds from given sketch. Param sketch: The given sketch to query as BYTES. Defaults: seed = 9001. Returns: a FLOAT64 value as the cardinality estimate.
theta_sketch_to_string	SCALAR	(sketch BYTES) -> STRING	Returns a summary string that represents the state of the given sketch. Param sketch: the given sketch as BYTES. Defaults: seed = 9001. Returns: a STRING that represents the state of the given sketch.
theta_sketch_get_num_retained	SCALAR	(sketch BYTES) -> INT	Returns the number of retained entries in the given sketch. Param sketch: The given sketch to query as BYTES. Defaults: seed = 9001. Returns: number of retained entries as INT.
theta_sketch_get_theta	SCALAR	(sketch BYTES) -> FLOAT64	Returns theta (effective sampling rate) as a fraction from 0 to 1. Param sketch: The given sketch to query as BYTES. Defaults: seed = 9001. Returns: theta as FLOAT64.
theta_sketch_get_num_retained_seed	SCALAR	(sketch BYTES, seed INT64) -> INT	Returns the number of retained entries in the given sketch. Param sketch: The given sketch to query as BYTES. Param seed: This is used to confirm that the given sketch was configured with the correct seed. Returns: number of retained entries as INT.
theta_sketch_get_estimate_seed	SCALAR	(sketch BYTES, seed INT64) -> FLOAT64	Gets cardinality estimate and bounds from given sketch. Param sketch: The given sketch to query as BYTES. Param seed: This is used to confirm that the given sketch was configured with the correct seed. Returns: a FLOAT64 value as the cardinality estimate.
theta_sketch_to_string_seed	SCALAR	(sketch BYTES, seed INT64) -> STRING	Returns a summary string that represents the state of the given sketch. Param sketch: the given sketch as BYTES. Param seed: This is used to confirm that the given sketch was configured with the correct seed. Returns: a STRING that represents the state of the given sketch.
theta_sketch_get_theta_seed	SCALAR	(sketch BYTES, seed INT64) -> FLOAT64	Returns theta (effective sampling rate) as a fraction from 0 to 1. Param sketch: The given sketch to query as BYTES. Param seed: This is used to confirm that the given sketch was configured with the correct seed. Returns: theta as FLOAT64.
theta_sketch_intersection	SCALAR	(sketchA BYTES, sketchB BYTES) -> BYTES	Computes a sketch that represents the scalar intersection of the two given sketches. Param sketchA: the first sketch as BYTES. Param sketchB: the second sketch as BYTES. Defaults: seed = 9001. Returns: a Compact, Compressed Theta Sketch, as BYTES.
theta_sketch_union	SCALAR	(sketchA BYTES, sketchB BYTES) -> BYTES	Computes a sketch that represents the scalar union of the two given sketches. Param sketchA: the first sketch as BYTES. Param sketchB: the second sketch as BYTES. Defaults: lg_k = 12, seed = 9001. Returns: a Compact, Compressed Theta Sketch, as BYTES.
theta_sketch_a_not_b	SCALAR	(sketchA BYTES, sketchB BYTES) -> BYTES	Computes a sketch that represents the scalar set difference: sketchA and not sketchB. Param sketchA: the first sketch "A" as bytes. Param sketchB: the second sketch "B" as bytes. Defaults: seed = 9001. Returns: a Compact, Compressed Theta Sketch, as BYTES.
theta_sketch_intersection_seed	SCALAR	(sketchA BYTES, sketchB BYTES, seed INT64) -> BYTES	Computes a sketch that represents the scalar intersection of the two given sketches. Param sketchA: the first sketch as BYTES. Param sketchB: the second sketch as BYTES. Param seed: This is used to confirm that the given sketches were configured with the correct seed. Returns: a Compact, Compressed Theta Sketch, as BYTES.
theta_sketch_a_not_b_seed	SCALAR	(sketchA BYTES, sketchB BYTES, seed INT64) -> BYTES	Computes a sketch that represents the scalar set difference: sketchA and not sketchB. Param sketchA: the first sketch "A" as bytes. Param sketchB: the second sketch "B" as bytes. Param seed: This is used to confirm that the given sketches were configured with the correct seed. Returns: a Compact, Compressed Theta Sketch, as BYTES.
theta_sketch_union_lgk_seed	SCALAR	(sketchA BYTES, sketchB BYTES, lg_k BYTEINT, seed INT64) -> BYTES	Computes a sketch that represents the scalar union of the two given sketches. Param sketchA: the first sketch as BYTES. Param sketchB: the second sketch as BYTES. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 26]. Param seed: This is used to confirm that the given sketches were configured with the correct seed. Returns: a Compact, Compressed Theta Sketch, as BYTES.
theta_sketch_get_estimate_and_bounds	SCALAR	(sketch BYTES, num_std_devs BYTEINT) -> STRUCT<estimate FLOAT64, lower_bound FLOAT64, upper_bound FLOAT64>	Gets cardinality estimate and bounds from given sketch. Param sketch: The given sketch to query as BYTES. Param num_std_devs: The returned bounds will be based on the statistical confidence interval determined by the given number of standard deviations from the returned estimate. This number may be one of {1,2,3}, where 1 represents 68% confidence, 2 represents 95% confidence and 3 represents 99.7% confidence. For example, if the given num_std_devs = 2 and the returned values are {1000, 990, 1010} that means that with 95% confidence, the true value lies within the range [990, 1010]. Defaults: seed = 9001. Returns: a STRUCT with three FLOAT64 values as {estimate, lower_bound, upper_bound}.
theta_sketch_jaccard_similarity	SCALAR	(sketchA BYTES, sketchB BYTES) -> STRUCT<lower_bound FLOAT64, estimate FLOAT64, upper_bound FLOAT64>	Computes the Jaccard similarity index with upper and lower bounds. The Jaccard similarity index J(A,B) = (A ^ B)/(A U B) is used to measure how similar the two sketches are to each other. If J = 1.0, the sketches are considered equal. If J = 0, the two sketches are disjoint. A Jaccard of .95 means the overlap between the two sets is 95% of the union of the two sets. Param sketchA: the first sketch as bytes. Param sketchB: the second sketch as bytes. Defaults: seed = 9001. Returns: a STRUCT with three FLOAT64 values {lower_bound, estimate, upper_bound} of the Jaccard index.
theta_sketch_get_estimate_and_bounds_seed	SCALAR	(sketch BYTES, num_std_devs BYTEINT, seed INT64) -> STRUCT<estimate FLOAT64, lower_bound FLOAT64, upper_bound FLOAT64>	Gets cardinality estimate and bounds from given sketch. Param sketch: The given sketch to query as BYTES. Param num_std_devs: The returned bounds will be based on the statistical confidence interval determined by the given number of standard deviations from the returned estimate. This number may be one of {1,2,3}, where 1 represents 68% confidence, 2 represents 95% confidence and 3 represents 99.7% confidence. For example, if the given num_std_devs = 2 and the returned values are {1000, 990, 1010} that means that with 95% confidence, the true value lies within the range [990, 1010]. Param seed: This is used to confirm that the given sketch was configured with the correct seed. Returns: a STRUCT with three FLOAT64 values as {estimate, lower_bound, upper_bound}.
theta_sketch_jaccard_similarity_seed	SCALAR	(sketchA BYTES, sketchB BYTES, seed INT64) -> STRUCT<lower_bound FLOAT64, estimate FLOAT64, upper_bound FLOAT64>	Computes the Jaccard similarity index with upper and lower bounds. The Jaccard similarity index J(A,B) = (A ^ B)/(A U B) is used to measure how similar the two sketches are to each other. If J = 1.0, the sketches are considered equal. If J = 0, the two sketches are disjoint. A Jaccard of .95 means the overlap between the two sets is 95% of the union of the two sets. Param sketchA: the first sketch as bytes. Param sketchB: the second sketch as bytes. Param seed: This is used to confirm that the given sketches were configured with the correct seed. Returns: a STRUCT with three FLOAT64 values {lower_bound, estimate, upper_bound} of the Jaccard index.

Examples:

# using defaults
create or replace table `$BQ_DATASET`.theta_sketch(sketch bytes);

insert into `$BQ_DATASET`.theta_sketch
(select `$BQ_DATASET`.theta_sketch_agg_int64(value) from unnest(GENERATE_ARRAY(1, 10000, 1)) as value);
insert into `$BQ_DATASET`.theta_sketch
(select `$BQ_DATASET`.theta_sketch_agg_int64(value) from unnest(GENERATE_ARRAY(100000, 110000, 1)) as value);

# expected about 20000
select `$BQ_DATASET`.theta_sketch_get_estimate_and_bounds(
  `$BQ_DATASET`.theta_sketch_agg_union(sketch),
  2
) from `$BQ_DATASET`.theta_sketch;

# expected estimate about 20000
select `$BQ_DATASET`.theta_sketch_to_string(
  `$BQ_DATASET`.theta_sketch_agg_union(sketch)
) from `$BQ_DATASET`.theta_sketch;

select `$BQ_DATASET`.theta_sketch_get_theta(
  `$BQ_DATASET`.theta_sketch_agg_union(sketch)
) from `$BQ_DATASET`.theta_sketch;

select `$BQ_DATASET`.theta_sketch_get_num_retained(
  `$BQ_DATASET`.theta_sketch_agg_union(sketch)
) from `$BQ_DATASET`.theta_sketch;

drop table `$BQ_DATASET`.theta_sketch;

# using full signatures
create or replace table `$BQ_DATASET`.theta_sketch(sketch bytes);

insert into `$BQ_DATASET`.theta_sketch
(select `$BQ_DATASET`.theta_sketch_agg_int64_lgk_seed_p(value, struct<int, int, float64>(14, 111, 0.9)) from unnest(GENERATE_ARRAY(1, 10000, 1)) as value);
insert into `$BQ_DATASET`.theta_sketch
(select `$BQ_DATASET`.theta_sketch_agg_int64_lgk_seed_p(value, struct<int, int, float64>(14, 111, 0.9)) from unnest(GENERATE_ARRAY(100000, 110000, 1)) as value);

# expected about 20000
select `$BQ_DATASET`.theta_sketch_get_estimate_and_bounds_seed(
  `$BQ_DATASET`.theta_sketch_agg_union_lgk_seed(sketch, struct<int, int>(10, 111)),
  2,
  111
) from `$BQ_DATASET`.theta_sketch;

# expected estimate about 20000
select `$BQ_DATASET`.theta_sketch_to_string_seed(
  `$BQ_DATASET`.theta_sketch_agg_union_lgk_seed(sketch, struct<int, int>(10, 111)),
  111
) from `$BQ_DATASET`.theta_sketch;

select `$BQ_DATASET`.theta_sketch_get_theta_seed(
  `$BQ_DATASET`.theta_sketch_agg_union_lgk_seed(sketch, struct<int, int>(10, 111)),
  111
) from `$BQ_DATASET`.theta_sketch;

select `$BQ_DATASET`.theta_sketch_get_num_retained_seed(
  `$BQ_DATASET`.theta_sketch_agg_union_lgk_seed(sketch, struct<int, int>(10, 111)),
  111
) from `$BQ_DATASET`.theta_sketch;

drop table `$BQ_DATASET`.theta_sketch;

# using defaults
# expected 5
select `$BQ_DATASET`.theta_sketch_get_estimate(
  `$BQ_DATASET`.theta_sketch_union(
    (select `$BQ_DATASET`.theta_sketch_agg_string(str) from unnest(["a", "b", "c"]) as str),
    (select `$BQ_DATASET`.theta_sketch_agg_string(str) from unnest(["c", "d", "e"]) as str)
  )
);

# full signatures
# expected 5
select `$BQ_DATASET`.theta_sketch_get_estimate_seed(
  `$BQ_DATASET`.theta_sketch_union_lgk_seed(
    (select `$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(str, struct<int, int, float64>(10, 111, 0.999)) from unnest(["a", "b", "c"]) as str),
    (select `$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(str, struct<int, int, float64>(10, 111, 0.999)) from unnest(["c", "d", "e"]) as str),
    10,
    111
  ),
  111
);

# using defaults
# expected 1
select `$BQ_DATASET`.theta_sketch_get_estimate(
  `$BQ_DATASET`.theta_sketch_intersection(
    (select `$BQ_DATASET`.theta_sketch_agg_string(str) from unnest(["a", "b", "c"]) as str),
    (select `$BQ_DATASET`.theta_sketch_agg_string(str) from unnest(["c", "d", "e"]) as str)
  )
);

# full signatures
# expected 1
select `$BQ_DATASET`.theta_sketch_get_estimate_seed(
  `$BQ_DATASET`.theta_sketch_intersection_seed(
    (select `$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(str, struct<int, int, float64>(10, 111, 0.999)) from unnest(["a", "b", "c"]) as str),
    (select `$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(str, struct<int, int, float64>(10, 111, 0.999)) from unnest(["c", "d", "e"]) as str),
    111
  ),
  111
);

# using defaults
# expected 2
select `$BQ_DATASET`.theta_sketch_get_estimate(
  `$BQ_DATASET`.theta_sketch_a_not_b(
    (select `$BQ_DATASET`.theta_sketch_agg_string(str) from unnest(["a", "b", "c"]) as str),
    (select `$BQ_DATASET`.theta_sketch_agg_string(str) from unnest(["c", "d", "e"]) as str)
  )
);

# full signatures
# expected 2
select `$BQ_DATASET`.theta_sketch_get_estimate_seed(
  `$BQ_DATASET`.theta_sketch_a_not_b_seed(
    (select `$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(str, struct<int, int, float64>(10, 111, 0.999)) from unnest(["a", "b", "c"]) as str),
    (select `$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(str, struct<int, int, float64>(10, 111, 0.999)) from unnest(["c", "d", "e"]) as str),
    111
  ),
  111
);

# using defaults
# expected 0.2
select `$BQ_DATASET`.theta_sketch_jaccard_similarity(
  (select `$BQ_DATASET`.theta_sketch_agg_string(str) from unnest(["a", "b", "c"]) as str),
  (select `$BQ_DATASET`.theta_sketch_agg_string(str) from unnest(["c", "d", "e"]) as str)
);

# using full signatures
# expected 0.2
select `$BQ_DATASET`.theta_sketch_jaccard_similarity_seed(
  (select `$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(str, struct<int, int, float64>(10, 111, 0.999)) from unnest(["a", "b", "c"]) as str),
  (select `$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(str, struct<int, int, float64>(10, 111, 0.999)) from unnest(["c", "d", "e"]) as str),
  111
);

Tuple Sketch Functions

Description: Tuple sketches extend the functionality of Theta sketches by allowing you to associate a summary value with each item in the set. This enables calculations like the sum, minimum, or maximum of values associated with the distinct items.

For more information: Tuple sketches

Function Name	Function Type	Signature	Description
tuple_sketch_int64_agg_union	AGGREGATE	(sketch BYTES) -> BYTES	Builds a Tuple Sketch that represents the UNION of the given column of Tuple Sketches. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given column of Tuple Sketches with an INT64 summary column. This may not be NULL. Defaults: lg_k = 12, seed = 9001, mode = SUM. Returns: a Compact Tuple Sketch as BYTES.
tuple_sketch_int64_agg_string	AGGREGATE	(key STRING, value INT64) -> BYTES	Builds a Tuple Sketch from an STRING Key column and an INT64 value column. Multiple values for the same key are aggregated using the default mode. Note that cardinality estimation accuracy, plots, error tables, and sampling probability p are the same as the Theta Sketch. This function only applies to Tuple Sketches with an STRING Key column and an INT64 summary column. Param key: the STRING column of identifiers. This may not be NULL. Param value: the INT64 value column associated with each key. This may not be NULL. Defaults: lg_k = 12, seed = 9001, p = 1.0, mode = SUM. Returns: a Compact Tuple Sketch as BYTES.
tuple_sketch_int64_agg_int64	AGGREGATE	(key INT64, value INT64) -> BYTES	Builds a Tuple Sketch from an INT64 Key column and an INT64 value column. Multiple values for the same key are aggregated using the default mode. Note that cardinality estimation accuracy, plots, error tables, and sampling probability p are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 Key column and an INT64 summary column. Param key: the INT64 key column of identifiers. This may not be NULL. Param value: the INT64 value column associated with each key. This may not be NULL. Defaults: lg_k = 12, seed = 9001, p = 1.0, mode = SUM. Returns: a Compact Tuple Sketch as BYTES.
tuple_sketch_int64_agg_union_lgk_seed_mode	AGGREGATE	(sketch BYTES, params STRUCT<lg_k BYTEINT, seed INT64, mode STRING> NOT AGGREGATE) -> BYTES	Builds a Tuple Sketch that represents the UNION of the given column of Tuple Sketches. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given column of Tuple Sketches with an INT64 summary column. This may not be NULL. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 26]. A NULL specifies the default lg_k of 12. Param seed: the seed to be used by the underlying hash function. A NULL specifies the default seed of 9001. Param mode: aggregation mode for the summary field: one of { SUM, MIN, MAX, ONE (constant 1) }. A NULL specifies the default = SUM. Returns: a Compact Tuple Sketch as BYTES.
tuple_sketch_int64_agg_int64_lgk_seed_p_mode	AGGREGATE	(key INT64, value INT64, params STRUCT<lg_k BYTEINT, seed INT64, p FLOAT64, mode STRING> NOT AGGREGATE) -> BYTES	Builds a Tuple Sketch from an INT64 Key column and an INT64 value column. Multiple values for the same key are aggregated using one of the selectable operations: { SUM, MIN, MAX, ONE (constant 1) }. Note that cardinality estimation accuracy, plots, error tables, and sampling probability p are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 Key column and an INT64 summary column. Param key: the INT64 key column of identifiers. This may not be NULL. Param value: the INT64 value column associated with each key. This may not be NULL. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 26]. A NULL specifies the default lg_k of 12. Param seed: the seed to be used by the underlying hash function. A NULL specifies the default seed of 9001. Param p: up-front sampling probability. A NULL specifies the default of 1.0. Param mode: aggregation mode for the summary field: one of { SUM, MIN, MAX, ONE (constant 1) }. A NULL specifies the default = SUM. Returns: a Compact Tuple Sketch as BYTES.
tuple_sketch_int64_agg_string_lgk_seed_p_mode	AGGREGATE	(key STRING, value INT64, params STRUCT<lg_k BYTEINT, seed INT64, p FLOAT64, mode STRING> NOT AGGREGATE) -> BYTES	Builds a Tuple Sketch from an STRING Key column and an INT64 value column. Multiple values for the same key are aggregated using one of the selectable operations: SUM, MIN, MAX, ONE. Note that cardinality estimation accuracy, plots, error tables, and sampling probability p are the same as the Theta Sketch. This function only applies to Tuple Sketches with an STRING Key column and an INT64 summary column. Param key: the STRING key column of identifiers. This may not be NULL. Param value: the INT64 value column associated with each key. This may not be NULL. Param lg_k: the sketch accuracy/size parameter as an integer in the range [4, 26]. A NULL specifies the default lg_k of 12. Param seed: the seed to be used by the underlying hash function. A NULL specifies the default seed of 9001. Param p: up-front sampling probability. A NULL specifies the default of 1.0. Param mode: aggregation mode for the summary field: one of { SUM, MIN, MAX, ONE (constant 1) }. A NULL specifies the default = SUM. Returns: a Compact Tuple Sketch as BYTES.
tuple_sketch_int64_to_string	SCALAR	(sketch BYTES) -> STRING	Returns a human readable STRING that is a short summary of the state of this sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the sketch to be summarized. This may not be NULL. Defaults: seed = 9001. Returns: A human readable STRING that is a short summary of the state of this sketch.
tuple_sketch_int64_get_estimate	SCALAR	(sketch BYTES) -> FLOAT64	Returns the cardinality estimate of the given Tuple Sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Defaults: seed = 9001. Returns: the cardinality estimate of the given Tuple Sketch
tuple_sketch_int64_get_theta	SCALAR	(sketch BYTES) -> FLOAT64	Returns theta (effective sampling rate) as a fraction from 0 to 1. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Defaults: seed = 9001. Returns: theta as FLOAT64.
tuple_sketch_int64_get_num_retained	SCALAR	(sketch BYTES) -> INT	Returns the number of retained entries in the given sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Defaults: seed = 9001. Returns: number of retained entries as INT.
tuple_sketch_int64_get_theta_seed	SCALAR	(sketch BYTES, seed INT64) -> FLOAT64	Returns theta (effective sampling rate) as a fraction from 0 to 1. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: theta as FLOAT64.
tuple_sketch_int64_get_num_retained_seed	SCALAR	(sketch BYTES, seed INT64) -> INT	Returns the number of retained entries in the given sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: number of retained entries as INT.
tuple_sketch_int64_to_string_seed	SCALAR	(sketch BYTES, seed INT64) -> STRING	Returns a human readable STRING that is a short summary of the state of this sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the sketch to be summarized. This may not be NULL. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: A human readable STRING that is a short summary of the state of this sketch.
tuple_sketch_int64_a_not_b	SCALAR	(sketchA BYTES, sketchB BYTES) -> BYTES	Computes a sketch that represents the set difference of sketchA and not sketchB. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketchA: the first sketch "A" as BYTES. This may not be NULL. Param sketchB: the second sketch "B" as BYTES. This may not be NULL. Defaults: seed = 9001. Returns: a Compact Tuple Sketch as BYTES.
tuple_sketch_int64_from_theta_sketch	SCALAR	(sketch BYTES, value INT64) -> BYTES	Converts the given Theta Sketch into a Tuple Sketch with a INT64 summary column set to the given INT64 value. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. Param sketch: the given Theta Sketch. This may not be NULL. Param value: the given INT64 value. This may not be NULL. Defaults: seed = 9001. Returns: a Tuple Sketch with an INT64 summary column as BYTES.
tuple_sketch_int64_get_estimate_seed	SCALAR	(sketch BYTES, seed INT64) -> FLOAT64	Returns the cardinality estimate of the given Tuple Sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: the cardinality estimate of the given Tuple Sketch
tuple_sketch_int64_intersection	SCALAR	(sketchA BYTES, sketchB BYTES) -> BYTES	Computes a sketch that represents the scalar intersection of sketchA and sketchB. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketchA: the first sketch "A" as BYTES. Param sketchB: the second sketch "B" as BYTES. Defaults: seed = 9001. Returns: a Compact Tuple Sketch as BYTES.
tuple_sketch_int64_union	SCALAR	(sketchA BYTES, sketchB BYTES) -> BYTES	Computes a Tuple Sketch that represents the UNION of sketchA and sketchB. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketchA: the first sketch "A" as BYTES. This may not be NULL. Param sketchB: the second sketch "B" as BYTES. This may not be NULL. Defaults: seed = 9001. Returns: a Compact Tuple Sketch as BYTES.
tuple_sketch_int64_from_theta_sketch_seed	SCALAR	(sketch BYTES, value INT64, seed INT64) -> BYTES	Converts the given Theta Sketch into a Tuple Sketch with a INT64 summary column set to the given INT64 value. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. Param sketch: the given Theta Sketch. This may not be NULL. Param value: the given INT64 value. This may not be NULL. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: a Tuple Sketch with an INT64 summary column as BYTES.
tuple_sketch_int64_a_not_b_seed	SCALAR	(sketchA BYTES, sketchB BYTES, seed INT64) -> BYTES	Computes a sketch that represents the scalar set difference of sketchA and not sketchB. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketchA: the first sketch "A" as BYTES. This may not be NULL. Param sketchB: the second sketch "B" as BYTES. This may not be NULL. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: a Compact Tuple Sketch as BYTES.
tuple_sketch_int64_filter_low_high	SCALAR	(sketch BYTES, low INT64, high INT64) -> BYTES	Returns a Tuple Sketch computed from the given sketch filtered by the given low and high values. This returns a compact tuple sketch that contains the subset of rows of the give sketch where the summary column is greater-than or equal to the given low and less-than or equal to the given high. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Param low: the given low INT64. This may not be NULL. Param high: the given high INT64. This may not be NULL. Defaults: seed = 9001. Returns: a Compact Tuple Sketch as BYTES.
tuple_sketch_int64_get_estimate_and_bounds	SCALAR	(sketch BYTES, num_std_devs BYTEINT) -> STRUCT<estimate FLOAT64, lower_bound FLOAT64, upper_bound FLOAT64>	Returns the cardinality estimate and bounds from the given Tuple Sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Param num_std_devs: The returned bounds will be based on the statistical confidence interval determined by the given number of standard deviations from the returned estimate. This number may be one of {1,2,3}, where 1 represents 68% confidence, 2 represents 95% confidence and 3 represents 99.7% confidence. For example, if the given num_std_devs = 2 and the returned values are {1000, 990, 1010} that means that with 95% confidence, the true value lies within the range [990, 1010]. Defaults: seed = 9001. Returns: a STRUCT with three FLOAT64 values as {estimate, lower_bound, upper_bound}.
tuple_sketch_int64_filter_low_high_seed	SCALAR	(sketch BYTES, low INT64, high INT64, seed INT64) -> BYTES	Returns a Tuple Sketch computed from the given sketch filtered by the given low and high values. This returns a compact tuple sketch that contains the subset of rows of the give sketch where the summary column is greater-than or equal to the given low and less-than or equal to the given high. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Param low: the given low INT64. This may not be NULL. Param high: the given high INT64. This may not be NULL. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: a Compact Tuple Sketch as BYTES.
tuple_sketch_int64_jaccard_similarity	SCALAR	(sketchA BYTES, sketchB BYTES) -> STRUCT<lower_bound FLOAT64, estimate FLOAT64, upper_bound FLOAT64>	Computes the Jaccard similarity index with upper and lower bounds. The Jaccard similarity index J(A,B) = (A ^ B)/(A U B) is used to measure how similar the two sketches are to each other. If J = 1.0, the sketches are considered equal. If J = 0, the two sketches are disjoint. A Jaccard of .95 means the overlap between the two sets is 95% of the union of the two sets. This function only applies to Tuple Sketches with an INT64 summary column. Param sketchA: the first sketch as bytes. This may not be NULL. Param sketchB: the second sketch as bytes. This may not be NULL. Defaults: seed = 9001. Returns: a STRUCT with three FLOAT64 values {lower_bound, estimate, upper_bound} of the Jaccard index.
tuple_sketch_int64_get_sum_estimate_and_bounds	SCALAR	(sketch BYTES, num_std_devs BYTEINT) -> STRUCT<sum_estimate FLOAT64, sum_lower_bound FLOAT64, sum_upper_bound FLOAT64>	Returns the estimate and bounds for the sum of the INT64 summary column scaled to the original population from the given Tuple Sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Param num_std_devs: The returned bounds will be based on the statistical confidence interval determined by the given number of standard deviations from the returned estimate. This number may be one of {1,2,3}, where 1 represents 68% confidence, 2 represents 95% confidence and 3 represents 99.7% confidence. For example, if the given num_std_devs = 2 and the returned values are {1000, 990, 1010} that means that with 95% confidence, the true value lies within the range [990, 1010]. Defaults: seed = 9001. Returns: a STRUCT with three FLOAT64 values as {sum_estimate, sum_lower_bound, sum_upper_bound}.
tuple_sketch_int64_intersection_seed_mode	SCALAR	(sketchA BYTES, sketchB BYTES, seed INT64, mode STRING) -> BYTES	Computes a sketch that represents the scalar intersection of sketchA and sketchB. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketchA: the first sketch "A" as BYTES. Param sketchB: the second sketch "B" as BYTES. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: a Compact Tuple Sketch as BYTES.
tuple_sketch_int64_get_sum_estimate_and_bounds_seed	SCALAR	(sketch BYTES, num_std_devs BYTEINT, seed INT64) -> STRUCT<sum_estimate FLOAT64, sum_lower_bound FLOAT64, sum_upper_bound FLOAT64>	Returns the estimate and bounds for the sum of the INT64 summary column scaled to the original population from the given Tuple Sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Param num_std_devs: The returned bounds will be based on the statistical confidence interval determined by the given number of standard deviations from the returned estimate. This number may be one of {1,2,3}, where 1 represents 68% confidence, 2 represents 95% confidence and 3 represents 99.7% confidence. For example, if the given num_std_devs = 2 and the returned values are {1000, 990, 1010} that means that with 95% confidence, the true value lies within the range [990, 1010]. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: a STRUCT with three FLOAT64 values as {sum_estimate, sum_lower_bound, sum_upper_bound}.
tuple_sketch_int64_union_lgk_seed_mode	SCALAR	(sketchA BYTES, sketchB BYTES, lg_k BYTEINT, seed INT64, mode STRING) -> BYTES	Computes a Tuple Sketch that represents the UNION of sketchA and sketchB. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketchA: the first sketch "A" as BYTES. This may not be NULL. Param sketchB: the second sketch "B" as BYTES. This may not be NULL. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: a Compact Tuple Sketch as BYTES.
tuple_sketch_int64_get_estimate_and_bounds_seed	SCALAR	(sketch BYTES, num_std_devs BYTEINT, seed INT64) -> STRUCT<estimate FLOAT64, lower_bound FLOAT64, upper_bound FLOAT64>	Returns the cardinality estimate and bounds from the given Tuple Sketch. Note that cardinality estimation accuracy, plots, and error tables are the same as the Theta Sketch. This function only applies to Tuple Sketches with an INT64 summary column. Param sketch: the given Tuple Sketch. This may not be NULL. Param num_std_devs: The returned bounds will be based on the statistical confidence interval determined by the given number of standard deviations from the returned estimate. This number may be one of {1,2,3}, where 1 represents 68% confidence, 2 represents 95% confidence and 3 represents 99.7% confidence. For example, if the given num_std_devs = 2 and the returned values are {1000, 990, 1010} that means that with 95% confidence, the true value lies within the range [990, 1010]. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: a STRUCT with three FLOAT64 values as {estimate, lower_bound, upper_bound}.
tuple_sketch_int64_jaccard_similarity_seed	SCALAR	(sketchA BYTES, sketchB BYTES, seed INT64) -> STRUCT<lower_bound FLOAT64, estimate FLOAT64, upper_bound FLOAT64>	Computes the Jaccard similarity index with upper and lower bounds. The Jaccard similarity index J(A,B) = (A ^ B)/(A U B) is used to measure how similar the two sketches are to each other. If J = 1.0, the sketches are considered equal. If J = 0, the two sketches are disjoint. A Jaccard of .95 means the overlap between the two sets is 95% of the union of the two sets. This function only applies to Tuple Sketches with an INT64 summary column. Param sketchA: the first sketch as bytes. This may not be NULL. Param sketchB: the second sketch as bytes. This may not be NULL. Param seed: This is used to confirm that the given sketches were configured with the correct seed. A NULL specifies the default seed = 9001. Returns: a STRUCT with three FLOAT64 values {lower_bound, estimate, upper_bound} of the Jaccard index.

Examples:

# using defaults
create or replace table `$BQ_DATASET`.tuple_sketch(sketch bytes);

insert into `$BQ_DATASET`.tuple_sketch
(select `$BQ_DATASET`.tuple_sketch_int64_from_theta_sketch(`$BQ_DATASET`.theta_sketch_agg_string(cast(value as string)), 1) from unnest(GENERATE_ARRAY(1, 10000, 1)) as value);
insert into `$BQ_DATASET`.tuple_sketch
(select `$BQ_DATASET`.tuple_sketch_int64_from_theta_sketch(`$BQ_DATASET`.theta_sketch_agg_string(cast(value as string)), 1) from unnest(GENERATE_ARRAY(100000, 110000, 1)) as value);

# expected about 20000
select `$BQ_DATASET`.tuple_sketch_int64_get_estimate(
  `$BQ_DATASET`.tuple_sketch_int64_agg_union(sketch)
) from `$BQ_DATASET`.tuple_sketch;

select `$BQ_DATASET`.tuple_sketch_int64_get_estimate_and_bounds(
  `$BQ_DATASET`.tuple_sketch_int64_agg_union(sketch),
  2
) from `$BQ_DATASET`.tuple_sketch;

select `$BQ_DATASET`.tuple_sketch_int64_get_sum_estimate_and_bounds(
  `$BQ_DATASET`.tuple_sketch_int64_agg_union(sketch),
  2
) from `$BQ_DATASET`.tuple_sketch;

# expected estimate about 20000
select `$BQ_DATASET`.tuple_sketch_int64_to_string(
  `$BQ_DATASET`.tuple_sketch_int64_agg_union(sketch)
) from `$BQ_DATASET`.tuple_sketch;

select `$BQ_DATASET`.tuple_sketch_int64_get_theta(
  `$BQ_DATASET`.tuple_sketch_int64_agg_union(sketch)
) from `$BQ_DATASET`.tuple_sketch;

select `$BQ_DATASET`.tuple_sketch_int64_get_num_retained(
  `$BQ_DATASET`.tuple_sketch_int64_agg_union(sketch)
) from `$BQ_DATASET`.tuple_sketch;

drop table `$BQ_DATASET`.tuple_sketch;

# using full signatures
create or replace table `$BQ_DATASET`.tuple_sketch(sketch bytes);

insert into `$BQ_DATASET`.tuple_sketch
(select `$BQ_DATASET`.tuple_sketch_int64_from_theta_sketch_seed(
  `$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(cast(value as string), STRUCT<BYTEINT, INT64, FLOAT64>(10, 111, 0.999)),
  1,
  111
) from unnest(GENERATE_ARRAY(1, 10000, 1)) as value);
insert into `$BQ_DATASET`.tuple_sketch
(select `$BQ_DATASET`.tuple_sketch_int64_from_theta_sketch_seed(
  `$BQ_DATASET`.theta_sketch_agg_string_lgk_seed_p(cast(value as string), STRUCT<BYTEINT, INT64, FLOAT64>(10, 111, 0.999)),
  1,
  111
) from unnest(GENERATE_ARRAY(100000, 110000, 1)) as value);

# expected about 20000
select `$BQ_DATASET`.tuple_sketch_int64_get_estimate_seed(
  `$BQ_DATASET`.tuple_sketch_int64_agg_union_lgk_seed_mode(sketch, STRUCT<BYTEINT, INT64, STRING>(10, 111, "NOP")),
  111
) from `$BQ_DATASET`.tuple_sketch;

select `$BQ_DATASET`.tuple_sketch_int64_get_estimate_and_bounds_seed(
  `$BQ_DATASET`.tuple_sketch_int64_agg_union_lgk_seed_mode(sketch, STRUCT<BYTEINT, INT64, STRING>(10, 111, "NOP")),
  2,
  111
) from `$BQ_DATASET`.tuple_sketch;

select `$BQ_DATASET`.tuple_sketch_int64_get_sum_estimate_and_bounds_seed(
  `$BQ_DATASET`.tuple_sketch_int64_agg_union_lgk_seed_mode(sketch, STRUCT<BYTEINT, INT64, STRING>(10, 111, "NOP")),
  2,
  111
) from `$BQ_DATASET`.tuple_sketch;

# expected estimate about 20000
select `$BQ_DATASET`.tuple_sketch_int64_to_string_seed(
  `$BQ_DATASET`.tuple_sketch_int64_agg_union_lgk_seed_mode(sketch, STRUCT<BYTEINT, INT64, STRING>(10, 111, "NOP")),
  111
) from `$BQ_DATASET`.tuple_sketch;

select `$BQ_DATASET`.tuple_sketch_int64_get_theta_seed(
  `$BQ_DATASET`.tuple_sketch_int64_agg_union_lgk_seed_mode(sketch, STRUCT<BYTEINT, INT64, STRING>(10, 111, "NOP")),
  111
) from `$BQ_DATASET`.tuple_sketch;

select `$BQ_DATASET`.tuple_sketch_int64_get_num_retained_seed(
  `$BQ_DATASET`.tuple_sketch_int64_agg_union_lgk_seed_mode(sketch, STRUCT<BYTEINT, INT64, STRING>(10, 111, "NOP")),
  111
) from `$BQ_DATASET`.tuple_sketch;

drop table `$BQ_DATASET`.tuple_sketch;


# using defaults
# expected 5
select `$BQ_DATASET`.tuple_sketch_int64_get_estimate(
  `$BQ_DATASET`.tuple_sketch_int64_union(
    (select `$BQ_DATASET`.tuple_sketch_int64_agg_int64(key, 1) from unnest([1, 2, 3]) as key),
    (select `$BQ_DATASET`.tuple_sketch_int64_agg_int64(key, 1) from unnest([3, 4, 5]) as key)
  )
);

# using full signatures
# expected 5
select `$BQ_DATASET`.tuple_sketch_int64_get_estimate_seed(
  `$BQ_DATASET`.tuple_sketch_int64_union_lgk_seed_mode(
    (select `$BQ_DATASET`.tuple_sketch_int64_agg_int64_lgk_seed_p_mode(key, 1, STRUCT<BYTEINT, INT64, FLOAT64, STRING>(10, 111, 0.999, "MIN")) from unnest([1, 2, 3]) as key),
    (select `$BQ_DATASET`.tuple_sketch_int64_agg_int64_lgk_seed_p_mode(key, 1, STRUCT<BYTEINT, INT64, FLOAT64, STRING>(10, 111, 0.999, "MIN")) from unnest([3, 4, 5]) as key),
    10,
    111,
    "MIN"
  ),
  111
);

# using defaults
# expected 1
select `$BQ_DATASET`.tuple_sketch_int64_get_estimate(
  `$BQ_DATASET`.tuple_sketch_int64_intersection(
    (select `$BQ_DATASET`.tuple_sketch_int64_agg_string(str, 1) from unnest(["a", "b", "c"]) as str),
    (select `$BQ_DATASET`.tuple_sketch_int64_agg_string(str, 1) from unnest(["c", "d", "e"]) as str)
  )
);

# using full signatures
# expected 1
select `$BQ_DATASET`.tuple_sketch_int64_get_estimate_seed(
  `$BQ_DATASET`.tuple_sketch_int64_intersection_seed_mode(
    (select `$BQ_DATASET`.tuple_sketch_int64_agg_string_lgk_seed_p_mode(str, 1, STRUCT<BYTEINT, INT64, FLOAT64, STRING>(10, 111, 0.999, "MIN")) from unnest(["a", "b", "c"]) as str),
    (select `$BQ_DATASET`.tuple_sketch_int64_agg_string_lgk_seed_p_mode(str, 1, STRUCT<BYTEINT, INT64, FLOAT64, STRING>(10, 111, 0.999, "MIN")) from unnest(["c", "d", "e"]) as str),
    111,
    "MIN"
  ),
  111
);

# using defaults
# expected 2
select `$BQ_DATASET`.tuple_sketch_int64_get_estimate(
  `$BQ_DATASET`.tuple_sketch_int64_a_not_b(
    (select `$BQ_DATASET`.tuple_sketch_int64_agg_string(str, 1) from unnest(["a", "b", "c"]) as str),
    (select `$BQ_DATASET`.tuple_sketch_int64_agg_string(str, 1) from unnest(["c", "d", "e"]) as str)
  )
);

# using full signatures
# expected 2
select `$BQ_DATASET`.tuple_sketch_int64_get_estimate_seed(
  `$BQ_DATASET`.tuple_sketch_int64_a_not_b_seed(
    (select `$BQ_DATASET`.tuple_sketch_int64_agg_string_lgk_seed_p_mode(str, 1, STRUCT<BYTEINT, INT64, FLOAT64, STRING>(10, 111, 0.999, "MIN")) from unnest(["a", "b", "c"]) as str),
    (select `$BQ_DATASET`.tuple_sketch_int64_agg_string_lgk_seed_p_mode(str, 1, STRUCT<BYTEINT, INT64, FLOAT64, STRING>(10, 111, 0.999, "MIN")) from unnest(["c", "d", "e"]) as str),
    111
  ),
  111
);

# using defaults
# expected 0.2
select `$BQ_DATASET`.tuple_sketch_int64_jaccard_similarity(
  (select `$BQ_DATASET`.tuple_sketch_int64_agg_string(str, 1) from unnest(["a", "b", "c"]) as str),
  (select `$BQ_DATASET`.tuple_sketch_int64_agg_string(str, 1) from unnest(["c", "d", "e"]) as str)
);

# using full signatures
# expected 0.2
select `$BQ_DATASET`.tuple_sketch_int64_jaccard_similarity_seed(
  (select `$BQ_DATASET`.tuple_sketch_int64_agg_string_lgk_seed_p_mode(str, 1, STRUCT<BYTEINT, INT64, FLOAT64, STRING>(10, 111, 0.999, "NOP")) from unnest(["a", "b", "c"]) as str),
  (select `$BQ_DATASET`.tuple_sketch_int64_agg_string_lgk_seed_p_mode(str, 1, STRUCT<BYTEINT, INT64, FLOAT64, STRING>(10, 111, 0.999, "NOP")) from unnest(["c", "d", "e"]) as str),
  111
);

# using defaults
# expected 1 entry
select `$BQ_DATASET`.tuple_sketch_int64_to_string(
  `$BQ_DATASET`.tuple_sketch_int64_filter_low_high(
    `$BQ_DATASET`.tuple_sketch_int64_agg_string(key, 1),
    2,
    2
  )
) from unnest(["a", "b", "c", "c"]) as key;

# using full signatures
# expected 1 entry
select `$BQ_DATASET`.tuple_sketch_int64_to_string_seed(
  `$BQ_DATASET`.tuple_sketch_int64_filter_low_high_seed(
    `$BQ_DATASET`.tuple_sketch_int64_agg_string_lgk_seed_p_mode(key, 1, STRUCT<BYTEINT, INT64, FLOAT64, STRING>(10, 111, 0.999, "SUM")),
    2,
    2,
    111
  ),
  111
) from unnest(["a", "b", "c", "c"]) as key;

TDigest Functions

Description: Similar to KLL sketch, estimates distributions of numeric values, provides approximate quantiles and ranks.

For more information: t-digest

Function Name	Function Type	Signature	Description
tdigest_double_build	AGGREGATE	(value FLOAT64) -> BYTES	Creates a sketch that represents the distribution of the given column. Param value: the column of FLOAT64 values. Defaults: k = 200. Returns: a t-Digest, as bytes.
tdigest_double_merge	AGGREGATE	(sketch BYTES) -> BYTES	Merges sketches from the given column. Param sketch: the column of values. Defaults: k = 200. Returns: a serialized t-Digest as BYTES.
tdigest_double_merge_k	AGGREGATE	(sketch BYTES, k INT NOT AGGREGATE) -> BYTES	Merges sketches from the given column. Param sketch: the column of values. Param k: the sketch accuracy/size parameter as an integer in the range [10, 65535]. Returns: a serialized t-Digest as BYTES.
tdigest_double_build_k	AGGREGATE	(value FLOAT64, k INT NOT AGGREGATE) -> BYTES	Creates a sketch that represents the distribution of the given column. Param value: the column of FLOAT64 values. Param k: the sketch accuracy/size parameter as an INT in the range [10, 65535]. Returns: a t-Digest, as bytes.
tdigest_double_get_max_value	SCALAR	(sketch BYTES) -> FLOAT64	Returns the maximum value of the input stream. Param sketch: the given sketch as BYTES. Returns: max value as FLOAT64
tdigest_double_to_string	SCALAR	(sketch BYTES) -> STRING	Returns a summary string that represents the state of the given sketch. Param sketch: the given sketch as sketch encoded bytes. Returns: a string that represents the state of the given sketch.
tdigest_double_get_total_weight	SCALAR	(sketch BYTES) -> INT64	Returns the total weight of the input stream. Param sketch: the given sketch as BYTES. Returns: total weight as INT64
tdigest_double_get_min_value	SCALAR	(sketch BYTES) -> FLOAT64	Returns the minimum value of the input stream. Param sketch: the given sketch as BYTES. Returns: min value as FLOAT64
tdigest_double_get_rank	SCALAR	(sketch BYTES, value FLOAT64) -> FLOAT64	Returns an approximation to the normalized rank, on the interval [0.0, 1.0], of the given value. Param sketch: the given sketch in serialized form. Param value: value to be ranked. Returns: an approximate rank of the given value.
tdigest_double_get_quantile	SCALAR	(sketch BYTES, rank FLOAT64) -> FLOAT64	Returns a value from the sketch that is the best approximation to a value from the original stream with the given rank. Param sketch: the given sketch in serialized form. Param rank: rank of a value in the hypothetical sorted stream. Returns: an approximate quantile associated with the given rank.

Examples:

create or replace table `$BQ_DATASET`.tdigest_double(sketch bytes);

# using default
insert into `$BQ_DATASET`.tdigest_double
(select `$BQ_DATASET`.tdigest_double_build(value) from unnest([1,2,3,4,5,6,7,8,9,10]) as value);

# using full signature
insert into `$BQ_DATASET`.tdigest_double
(select `$BQ_DATASET`.tdigest_double_build_k(value, 100) from unnest([11,12,13,14,15,16,17,18,19,20]) as value);

select `$BQ_DATASET`.tdigest_double_to_string(sketch) from `$BQ_DATASET`.tdigest_double;

# using default
select `$BQ_DATASET`.tdigest_double_to_string(`$BQ_DATASET`.tdigest_double_merge(sketch)) from `$BQ_DATASET`.tdigest_double;

# using full signature
select `$BQ_DATASET`.tdigest_double_to_string(`$BQ_DATASET`.tdigest_double_merge_k(sketch, 100)) from `$BQ_DATASET`.tdigest_double;

# expected 0.5
select `$BQ_DATASET`.tdigest_double_get_rank(`$BQ_DATASET`.tdigest_double_merge(sketch), 10) from `$BQ_DATASET`.tdigest_double;

# expected 10
select `$BQ_DATASET`.tdigest_double_get_quantile(`$BQ_DATASET`.tdigest_double_merge(sketch), 0.5) from `$BQ_DATASET`.tdigest_double;

# expected 20
select `$BQ_DATASET`.tdigest_double_get_total_weight(`$BQ_DATASET`.tdigest_double_merge(sketch)) from `$BQ_DATASET`.tdigest_double;

# expected 1
select `$BQ_DATASET`.tdigest_double_get_min_value(`$BQ_DATASET`.tdigest_double_merge(sketch)) from `$BQ_DATASET`.tdigest_double;

# expected 20
select `$BQ_DATASET`.tdigest_double_get_max_value(`$BQ_DATASET`.tdigest_double_merge(sketch)) from `$BQ_DATASET`.tdigest_double;

drop table `$BQ_DATASET`.tdigest_double;

Name		Name	Last commit message	Last commit date
Latest commit History 216 Commits
cpc		cpc
datasketches-cpp @ da8c41f		datasketches-cpp @ da8c41f
fi		fi
hll		hll
kll		kll
tdigest		tdigest
theta		theta
tuple		tuple
.asf.yaml		.asf.yaml
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
README_template.md		README_template.md
readme_generator.py		readme_generator.py
substitute_and_run.sh		substitute_and_run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache DataSketches library functions for Google BigQuery

Requirements

Building, Installing, and Testing

BigQuery Sketch Functions

CPC Sketch Functions

FI Sketch Functions

HLL Sketch Functions

KLL Sketch Functions

Theta Sketch Functions

Tuple Sketch Functions

TDigest Functions

About

Releases

Packages

Contributors 4

Languages

License

apache/datasketches-bigquery

Folders and files

Latest commit

History

Repository files navigation

Apache DataSketches library functions for Google BigQuery

Requirements

Building, Installing, and Testing

BigQuery Sketch Functions

CPC Sketch Functions

FI Sketch Functions

HLL Sketch Functions

KLL Sketch Functions

Theta Sketch Functions

Tuple Sketch Functions

TDigest Functions

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages