Address review comments

facebookincubator · Oct 5, 2023 · a006521 · a006521
1 parent 7183ab7
commit a006521
Show file tree

Hide file tree

Showing 2 changed files with 24 additions and 17 deletions.
diff --git a/velox/core/PlanNode.h b/velox/core/PlanNode.h
@@ -795,6 +795,7 @@ class GroupIdNode : public PlanNode {
   /// @param groupIdName Name of the column that will contain the grouping set
   /// ID (a zero based integer).
   /// @param source Input plan node.
+  /// NOTE: THIS FUNCTION IS DEPRECATED. PLEASE DO NOT USE.
   GroupIdNode(
       PlanNodeId id,
       std::vector<std::vector<FieldAccessTypedExprPtr>> groupingSets,

diff --git a/velox/docs/develop/operators.rst b/velox/docs/develop/operators.rst
@@ -240,31 +240,35 @@ followed by the group ID column. The type of group ID column is BIGINT.
    * - groupIdName
      - The name for the group-id column that identifies the grouping set. Zero-based integer corresponding to the position of the grouping set in the 'groupingSets' list.
 
-To illustrate why GroupingSets should use output column names lets examine the following SQL query:
+GroupIdNode is typically used to compute GROUPING SETS, CUBE and ROLLUP.
+
+While usually GroupingSets do not repeat with the same grouping key column, there are some use-cases where
+they might. To illustrate why GroupingSets might do so lets examine the following SQL query:
 
 .. code-block:: sql
 
-  select COUNT(orderkey), count(distinct orderkey) from orders;
+  SELECT count(orderkey), count(DISTINCT orderkey) FROM orders;
 
-In this query the user wants to compute aggregates on the same key, though with
+In this query the user wants to compute global aggregates using the same column, though with
 and without the DISTINCT clause. With a particular optimization strategy
 optimize.mixed-distinct-aggregations, Presto uses GroupIdNode to compute these.
 
-First, the optimizer duplicates every row assigning one copy to group 0 and another
-to group 1. This is achieved using the GroupIdNode with 2 grouping sets
+First, the optimizer creates a GroupIdNode to duplicate every row assigning one copy
+to group 0 and another to group 1. This is achieved using the GroupIdNode with 2 grouping sets
 each using orderkey as a grouping key. In order to disambiguate the
 groups the orderkey column is aliased as a grouping key for one of the
 grouping sets.
 
-Lets say the orders table has 10 rows as follows:
+Lets say the orders table has 5 rows:
 
 .. code-block::
 
   orderkey
      1
      2
-     ...
-     10
+     2
+     3
+     4
 
 The GroupIdNode would transform this into:
 
@@ -273,25 +277,27 @@ The GroupIdNode would transform this into:
     orderkey   orderkey1   group_id
     1             null        0
     2             null        0
-    ...
-    10            null        0
+    2             null        0
+    3             null        0
+    4             null        0
     null           1          1
     null           2          1
-                  ...
-    null           10         1
+    null           2          1
+    null           3          1
+    null           4          1
 
 Then Presto plans an aggregation using (orderkey, group_id) and count(orderkey1).
 
-This results in the following 11 rows:
+This results in the following 5 rows:
 
 .. code-block::
 
     orderkey     group_id     count(orderkey1) as c
     1                0         null
     2                0         null
-    ...
-    10               0         null
-    null             1          10
+    3                0         null
+    4                0         null
+    null             1          5
 
 Then Presto plans a second aggregation with no keys and count(orderkey), arbitrary(c).
 Since both aggregations ignore nulls this correctly computes the number of
@@ -300,7 +306,7 @@ distinct orderkeys and the count of all orderkeys.
 .. code-block::
 
     count(orderkey)     arbitrary(c)
-    10                     10
+     4                     5
 
 
 HashJoinNode and MergeJoinNode