Skip to content

Commit

Permalink
Address review comments
Browse files Browse the repository at this point in the history
  • Loading branch information
aditi-pandit committed Oct 5, 2023
1 parent 7183ab7 commit a006521
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 17 deletions.
1 change: 1 addition & 0 deletions velox/core/PlanNode.h
Original file line number Diff line number Diff line change
Expand Up @@ -795,6 +795,7 @@ class GroupIdNode : public PlanNode {
/// @param groupIdName Name of the column that will contain the grouping set
/// ID (a zero based integer).
/// @param source Input plan node.
/// NOTE: THIS FUNCTION IS DEPRECATED. PLEASE DO NOT USE.
GroupIdNode(
PlanNodeId id,
std::vector<std::vector<FieldAccessTypedExprPtr>> groupingSets,
Expand Down
40 changes: 23 additions & 17 deletions velox/docs/develop/operators.rst
Original file line number Diff line number Diff line change
Expand Up @@ -240,31 +240,35 @@ followed by the group ID column. The type of group ID column is BIGINT.
* - groupIdName
- The name for the group-id column that identifies the grouping set. Zero-based integer corresponding to the position of the grouping set in the 'groupingSets' list.

To illustrate why GroupingSets should use output column names lets examine the following SQL query:
GroupIdNode is typically used to compute GROUPING SETS, CUBE and ROLLUP.

While usually GroupingSets do not repeat with the same grouping key column, there are some use-cases where
they might. To illustrate why GroupingSets might do so lets examine the following SQL query:

.. code-block:: sql
select COUNT(orderkey), count(distinct orderkey) from orders;
SELECT count(orderkey), count(DISTINCT orderkey) FROM orders;
In this query the user wants to compute aggregates on the same key, though with
In this query the user wants to compute global aggregates using the same column, though with
and without the DISTINCT clause. With a particular optimization strategy
optimize.mixed-distinct-aggregations, Presto uses GroupIdNode to compute these.

First, the optimizer duplicates every row assigning one copy to group 0 and another
to group 1. This is achieved using the GroupIdNode with 2 grouping sets
First, the optimizer creates a GroupIdNode to duplicate every row assigning one copy
to group 0 and another to group 1. This is achieved using the GroupIdNode with 2 grouping sets
each using orderkey as a grouping key. In order to disambiguate the
groups the orderkey column is aliased as a grouping key for one of the
grouping sets.

Lets say the orders table has 10 rows as follows:
Lets say the orders table has 5 rows:

.. code-block::
orderkey
1
2
...
10
2
3
4
The GroupIdNode would transform this into:

Expand All @@ -273,25 +277,27 @@ The GroupIdNode would transform this into:
orderkey orderkey1 group_id
1 null 0
2 null 0
...
10 null 0
2 null 0
3 null 0
4 null 0
null 1 1
null 2 1
...
null 10 1
null 2 1
null 3 1
null 4 1
Then Presto plans an aggregation using (orderkey, group_id) and count(orderkey1).

This results in the following 11 rows:
This results in the following 5 rows:

.. code-block::
orderkey group_id count(orderkey1) as c
1 0 null
2 0 null
...
10 0 null
null 1 10
3 0 null
4 0 null
null 1 5
Then Presto plans a second aggregation with no keys and count(orderkey), arbitrary(c).
Since both aggregations ignore nulls this correctly computes the number of
Expand All @@ -300,7 +306,7 @@ distinct orderkeys and the count of all orderkeys.
.. code-block::
count(orderkey) arbitrary(c)
10 10
4 5
HashJoinNode and MergeJoinNode
Expand Down

0 comments on commit a006521

Please sign in to comment.