From b82036f8433e936fbbf586f8f0f7a9138cc2feb0 Mon Sep 17 00:00:00 2001 From: xzhangxian1008 Date: Sat, 8 Feb 2025 09:38:34 +0800 Subject: [PATCH 1/6] init --- .../aggregate-group-by-functions.md | 26 +++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/functions-and-operators/aggregate-group-by-functions.md b/functions-and-operators/aggregate-group-by-functions.md index 6fe8a1069894f..e40570f647d16 100644 --- a/functions-and-operators/aggregate-group-by-functions.md +++ b/functions-and-operators/aggregate-group-by-functions.md @@ -64,6 +64,32 @@ In addition, TiDB also provides the following aggregate functions: Except for the `GROUP_CONCAT()` and `APPROX_PERCENTILE()` functions, all the preceding functions can serve as [Window functions](/functions-and-operators/window-functions.md). ++ `APPROX_COUNT_DISTINCT(expr)` + + This function returns the approximate distinct count of `expr`. It uses `BJKST` algorithm and consumes less memory during simultaneous computation of cardinality for a large number of data sets whose cardinality has power law distribution. Moreover, it's very accurate for data sets with small cardinality and very efficient on CPU. + + The following example shows how to use this function: + + ```sql + DROP TABLE IF EXISTS t; + CREATE TABLE t(a INT, b INT); + INSERT INTO t VALUES(1, 1), (2, 1), (2, 1), (3, 1), (5, 2), (5, 2), (6, 2), (7, 2); + ``` + + ```sql + SELECT APPROX_COUNT_DISTINCT(a) FROM t GROUP BY b; + ``` + + ```sql + +--------------------------+ + | APPROX_COUNT_DISTINCT(a) | + +--------------------------+ + | 3 | + | 3 | + +--------------------------+ + 1 row in set (0.00 sec) + ``` + ## GROUP BY modifiers TiDB does not currently support `GROUP BY` modifiers such as `WITH ROLLUP`. We plan to add support in the future. See [TiDB #4250](https://github.com/pingcap/tidb/issues/4250). From 6ea13c1d54823d3291e0e653e7f5031aea27d328 Mon Sep 17 00:00:00 2001 From: xzhangxian1008 Date: Sat, 8 Feb 2025 16:19:22 +0800 Subject: [PATCH 2/6] address comments --- .../aggregate-group-by-functions.md | 24 +++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/functions-and-operators/aggregate-group-by-functions.md b/functions-and-operators/aggregate-group-by-functions.md index e40570f647d16..bad370c69623a 100644 --- a/functions-and-operators/aggregate-group-by-functions.md +++ b/functions-and-operators/aggregate-group-by-functions.md @@ -64,30 +64,30 @@ In addition, TiDB also provides the following aggregate functions: Except for the `GROUP_CONCAT()` and `APPROX_PERCENTILE()` functions, all the preceding functions can serve as [Window functions](/functions-and-operators/window-functions.md). -+ `APPROX_COUNT_DISTINCT(expr)` ++ `APPROX_COUNT_DISTINCT(expr, [expr...])` - This function returns the approximate distinct count of `expr`. It uses `BJKST` algorithm and consumes less memory during simultaneous computation of cardinality for a large number of data sets whose cardinality has power law distribution. Moreover, it's very accurate for data sets with small cardinality and very efficient on CPU. + The usage of this function is almost same with `COUNT(DISTINCT)` but returns approximate result. It uses `BJKST` algorithm and consumes less memory during simultaneous computation of cardinality for a large number of data sets whose cardinality has power law distribution. Moreover, it's very accurate for data sets with small cardinality and very efficient on CPU. The following example shows how to use this function: ```sql DROP TABLE IF EXISTS t; - CREATE TABLE t(a INT, b INT); - INSERT INTO t VALUES(1, 1), (2, 1), (2, 1), (3, 1), (5, 2), (5, 2), (6, 2), (7, 2); + CREATE TABLE t(a INT, b INT, c INT); + INSERT INTO t VALUES(1, 1, 1), (2, 1, 1), (2, 2, 1), (3, 1, 1), (5, 1, 2), (5, 1, 2), (6, 1, 2), (7, 1, 2); ``` ```sql - SELECT APPROX_COUNT_DISTINCT(a) FROM t GROUP BY b; + SELECT APPROX_COUNT_DISTINCT(a, b) FROM t GROUP BY c; ``` ```sql - +--------------------------+ - | APPROX_COUNT_DISTINCT(a) | - +--------------------------+ - | 3 | - | 3 | - +--------------------------+ - 1 row in set (0.00 sec) + +-----------------------------+ + | approx_count_distinct(a, b) | + +-----------------------------+ + | 3 | + | 4 | + +-----------------------------+ + 2 rows in set (0.00 sec) ``` ## GROUP BY modifiers From b29e9bf99ddc4db6a3724165e635428e93518837 Mon Sep 17 00:00:00 2001 From: xzhangxian1008 Date: Wed, 12 Feb 2025 10:55:43 +0800 Subject: [PATCH 3/6] Update functions-and-operators/aggregate-group-by-functions.md Co-authored-by: Grace Cai --- functions-and-operators/aggregate-group-by-functions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/functions-and-operators/aggregate-group-by-functions.md b/functions-and-operators/aggregate-group-by-functions.md index bad370c69623a..01045c24e03b4 100644 --- a/functions-and-operators/aggregate-group-by-functions.md +++ b/functions-and-operators/aggregate-group-by-functions.md @@ -66,7 +66,7 @@ Except for the `GROUP_CONCAT()` and `APPROX_PERCENTILE()` functions, all the pre + `APPROX_COUNT_DISTINCT(expr, [expr...])` - The usage of this function is almost same with `COUNT(DISTINCT)` but returns approximate result. It uses `BJKST` algorithm and consumes less memory during simultaneous computation of cardinality for a large number of data sets whose cardinality has power law distribution. Moreover, it's very accurate for data sets with small cardinality and very efficient on CPU. + This function is similar to `COUNT(DISTINCT)` in counting the number of distinct values but returns an approximate result. It uses the `BJKST` algorithm, significantly reducing memory consumption when processing large datasets with a power-law distribution. Moreover, for low-cardinality data, this function provides high accuracy while maintaining efficient CPU utilization. The following example shows how to use this function: From 0c0ab16671cc496a0af4b337cd8681b3f8ef1846 Mon Sep 17 00:00:00 2001 From: xzhangxian1008 Date: Wed, 12 Feb 2025 10:56:47 +0800 Subject: [PATCH 4/6] Update functions-and-operators/aggregate-group-by-functions.md --- functions-and-operators/aggregate-group-by-functions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/functions-and-operators/aggregate-group-by-functions.md b/functions-and-operators/aggregate-group-by-functions.md index 01045c24e03b4..366cfb7199846 100644 --- a/functions-and-operators/aggregate-group-by-functions.md +++ b/functions-and-operators/aggregate-group-by-functions.md @@ -80,7 +80,7 @@ Except for the `GROUP_CONCAT()` and `APPROX_PERCENTILE()` functions, all the pre SELECT APPROX_COUNT_DISTINCT(a, b) FROM t GROUP BY c; ``` - ```sql + ``` +-----------------------------+ | approx_count_distinct(a, b) | +-----------------------------+ From fd2bedb68648683baf5cc747df839c49cb5b4caf Mon Sep 17 00:00:00 2001 From: xzhangxian1008 Date: Wed, 12 Feb 2025 11:24:06 +0800 Subject: [PATCH 5/6] tweaking --- functions-and-operators/aggregate-group-by-functions.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/functions-and-operators/aggregate-group-by-functions.md b/functions-and-operators/aggregate-group-by-functions.md index 366cfb7199846..7eb458f2f6b5e 100644 --- a/functions-and-operators/aggregate-group-by-functions.md +++ b/functions-and-operators/aggregate-group-by-functions.md @@ -62,8 +62,6 @@ In addition, TiDB also provides the following aggregate functions: 1 row in set (0.00 sec) ``` -Except for the `GROUP_CONCAT()` and `APPROX_PERCENTILE()` functions, all the preceding functions can serve as [Window functions](/functions-and-operators/window-functions.md). - + `APPROX_COUNT_DISTINCT(expr, [expr...])` This function is similar to `COUNT(DISTINCT)` in counting the number of distinct values but returns an approximate result. It uses the `BJKST` algorithm, significantly reducing memory consumption when processing large datasets with a power-law distribution. Moreover, for low-cardinality data, this function provides high accuracy while maintaining efficient CPU utilization. @@ -90,6 +88,8 @@ Except for the `GROUP_CONCAT()` and `APPROX_PERCENTILE()` functions, all the pre 2 rows in set (0.00 sec) ``` +Except for the `GROUP_CONCAT()`, `APPROX_PERCENTILE()` and `APPROX_COUNT_DISTINCT` functions, all the preceding functions can serve as [Window functions](/functions-and-operators/window-functions.md). + ## GROUP BY modifiers TiDB does not currently support `GROUP BY` modifiers such as `WITH ROLLUP`. We plan to add support in the future. See [TiDB #4250](https://github.com/pingcap/tidb/issues/4250). From cdd7f0667707db1fb7cbf739880486e2af337a3f Mon Sep 17 00:00:00 2001 From: Grace Cai Date: Wed, 12 Feb 2025 11:29:23 +0800 Subject: [PATCH 6/6] minor punctuation changes --- functions-and-operators/aggregate-group-by-functions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/functions-and-operators/aggregate-group-by-functions.md b/functions-and-operators/aggregate-group-by-functions.md index 7eb458f2f6b5e..cc442c227e9c6 100644 --- a/functions-and-operators/aggregate-group-by-functions.md +++ b/functions-and-operators/aggregate-group-by-functions.md @@ -88,7 +88,7 @@ In addition, TiDB also provides the following aggregate functions: 2 rows in set (0.00 sec) ``` -Except for the `GROUP_CONCAT()`, `APPROX_PERCENTILE()` and `APPROX_COUNT_DISTINCT` functions, all the preceding functions can serve as [Window functions](/functions-and-operators/window-functions.md). +Except for the `GROUP_CONCAT()`, `APPROX_PERCENTILE()`, and `APPROX_COUNT_DISTINCT` functions, all the preceding functions can serve as [Window functions](/functions-and-operators/window-functions.md). ## GROUP BY modifiers