From a006521288a2896694b72d3f3fa20ab17643be36 Mon Sep 17 00:00:00 2001 From: aditi-pandit Date: Thu, 5 Oct 2023 16:15:41 -0700 Subject: [PATCH] Address review comments --- velox/core/PlanNode.h | 1 + velox/docs/develop/operators.rst | 40 ++++++++++++++++++-------------- 2 files changed, 24 insertions(+), 17 deletions(-) diff --git a/velox/core/PlanNode.h b/velox/core/PlanNode.h index 78b3836953b50..88c45c4a063a2 100644 --- a/velox/core/PlanNode.h +++ b/velox/core/PlanNode.h @@ -795,6 +795,7 @@ class GroupIdNode : public PlanNode { /// @param groupIdName Name of the column that will contain the grouping set /// ID (a zero based integer). /// @param source Input plan node. + /// NOTE: THIS FUNCTION IS DEPRECATED. PLEASE DO NOT USE. GroupIdNode( PlanNodeId id, std::vector> groupingSets, diff --git a/velox/docs/develop/operators.rst b/velox/docs/develop/operators.rst index 0ad12d3e0b8a0..c7ff656e4ef12 100644 --- a/velox/docs/develop/operators.rst +++ b/velox/docs/develop/operators.rst @@ -240,31 +240,35 @@ followed by the group ID column. The type of group ID column is BIGINT. * - groupIdName - The name for the group-id column that identifies the grouping set. Zero-based integer corresponding to the position of the grouping set in the 'groupingSets' list. -To illustrate why GroupingSets should use output column names lets examine the following SQL query: +GroupIdNode is typically used to compute GROUPING SETS, CUBE and ROLLUP. + +While usually GroupingSets do not repeat with the same grouping key column, there are some use-cases where +they might. To illustrate why GroupingSets might do so lets examine the following SQL query: .. code-block:: sql - select COUNT(orderkey), count(distinct orderkey) from orders; + SELECT count(orderkey), count(DISTINCT orderkey) FROM orders; -In this query the user wants to compute aggregates on the same key, though with +In this query the user wants to compute global aggregates using the same column, though with and without the DISTINCT clause. With a particular optimization strategy optimize.mixed-distinct-aggregations, Presto uses GroupIdNode to compute these. -First, the optimizer duplicates every row assigning one copy to group 0 and another -to group 1. This is achieved using the GroupIdNode with 2 grouping sets +First, the optimizer creates a GroupIdNode to duplicate every row assigning one copy +to group 0 and another to group 1. This is achieved using the GroupIdNode with 2 grouping sets each using orderkey as a grouping key. In order to disambiguate the groups the orderkey column is aliased as a grouping key for one of the grouping sets. -Lets say the orders table has 10 rows as follows: +Lets say the orders table has 5 rows: .. code-block:: orderkey 1 2 - ... - 10 + 2 + 3 + 4 The GroupIdNode would transform this into: @@ -273,25 +277,27 @@ The GroupIdNode would transform this into: orderkey orderkey1 group_id 1 null 0 2 null 0 - ... - 10 null 0 + 2 null 0 + 3 null 0 + 4 null 0 null 1 1 null 2 1 - ... - null 10 1 + null 2 1 + null 3 1 + null 4 1 Then Presto plans an aggregation using (orderkey, group_id) and count(orderkey1). -This results in the following 11 rows: +This results in the following 5 rows: .. code-block:: orderkey group_id count(orderkey1) as c 1 0 null 2 0 null - ... - 10 0 null - null 1 10 + 3 0 null + 4 0 null + null 1 5 Then Presto plans a second aggregation with no keys and count(orderkey), arbitrary(c). Since both aggregations ignore nulls this correctly computes the number of @@ -300,7 +306,7 @@ distinct orderkeys and the count of all orderkeys. .. code-block:: count(orderkey) arbitrary(c) - 10 10 + 4 5 HashJoinNode and MergeJoinNode