-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix GroupId with aliased grouping key columns #6738
Conversation
✅ Deploy Preview for meta-velox canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aditi-pandit Aditi, thank you for investigating this issue and coming up with a fix. This is very confusing, so let's figure out how to make this clear for future readers. Let's update comments for GroupIdNode and documentation in https://facebookincubator.github.io/velox/develop/operators.html#groupidnode. I feel that verbal explanation won't be sufficient and wondering if we could add a few examples of GroupNodeId configurations along with sample input and output.
c788225
to
89209eb
Compare
@mbasmanova : Thanks Masha. Couple of points:
In terms of fixes there are 2 changes: Hope that clarifies the fixes. Please let me know if you have further questions. |
130451d
to
4ef2976
Compare
@mbasmanova : Changes made:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aditi-pandit Thank you for iterating on this. Some questions. I'll take a closer look a bit later.
9ff5290
to
33e6bc9
Compare
@mbasmanova : Have fixed the docs and make the backward compatibility code simpler. PTAL. |
33e6bc9
to
7183ab7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aditi-pandit Thank you for iterating on this PR. Looks great % a few nits.
auto plan = PlanBuilder() | ||
.values({data}) | ||
.groupId( | ||
{"o_key", "o_key as o_key_1"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still find this use case confusing. It is just not clear why would anything use a plan like this. I wish we could identify a compelling use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt the query documented with its Presto plan was a better example. But I wasn't able to translate the IF expressions to equivalent CASE expressions. case group_id when 1 then 0 else null end
could not compile since null is UNKNOWN type and CASE required all when/then expressions to evaluate to the same type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try "null::bigint"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will send out a follow up PR.
velox/docs/develop/operators.rst
Outdated
|
||
In this query the user wants to compute aggregates on the same key, though with | ||
and without the DISTINCT clause. With a particular optimization strategy | ||
optimize.mixed-distinct-aggregations, Presto uses GroupIdNode to compute these. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we link here to Presto's documentation about this property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just found https://prestodb.io/docs/current/release/release-0.156.html?highlight=mixed%20distinct%20aggregations. Do you want to link this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this post which seems useful to link to: https://www.qubole.com/blog/presto-optimizes-aggregations-over-distinct-values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this seems useful: trinodb/trino#15927
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the link. Have added it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks.
Any chance you could add a link for "optimize.mixed-distinct-aggregations"?
https://www.qubole.com/blog/presto-optimizes-aggregations-over-distinct-values
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
a006521
to
6c60143
Compare
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
6c60143
to
bdc4e82
Compare
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@aditi-pandit I'm seeing e2e test failures:
|
@mbasmanova : Taking a look. |
bdc4e82
to
20ad799
Compare
@mbasmanova : Have fixed the issue. Prestissimo passes groupingSets with input column names so I have to fix this to the output column name in the backward compatible logic of GroupIdNode construction. I had missed that. Note : The original change needs the follow up prestodb/presto#20964 as well Thanks for your patience. |
5fbb618
to
80a4f8f
Compare
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@aditi-pandit Aditi, thank you for the quick fix. The changes make sense to me. I'm going to try to merge now. |
Thanks @mbasmanova. Appreciate the help. |
80a4f8f
to
e12b2d7
Compare
GROUPING SETS can be specified with grouping keys that are alias columns of the same input column. This is typically used to compute multiple mixed aggregations on the same key. The Velox operator always assumes that only input columns are specified as grouping keys. Whereas Presto does send plan fragments correctly specifying the output column name for such cases. Due to Velox's assumptions in each result grouping set, the column or its alias column values all appear in the same output column leading to incorrect results.
e12b2d7
to
ea46371
Compare
@aditi-pandit @mbasmanova Is this merged? |
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@mbasmanova merged this pull request in b5cf638. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Summary: GROUPING SETS can be specified with grouping keys that are alias columns of the same input column. This is typically used to compute multiple mixed aggregations on the same key. e.g. `select COUNT(orderkey), count(distinct orderkey) from orders;` or `SELECT lna, lnb, SUM(quantity) FROM (SELECT linenumber lna, linenumber lnb, CAST(quantity AS BIGINT) quantity FROM lineitem) GROUP BY GROUPING SETS ((lna, lnb), (lna), (lnb), ()) ` The Velox operator always assumes that : - An input column maps to a single output grouping key. The Prestissimo code assumes that - A GroupingSet is specified using input column names. Presto co-ordinator uses output column names for Grouping Sets allowing for multiple uses of the same input column. Velox/Presto behavior lead to incorrect results. In each grouping set the column or its alias column values all appear in the same output column leading to wrong computations. Fixes prestodb/presto#20910 and prestodb/presto#20917 Prestissimo side fix prestodb/presto#20964 is also needed. Pull Request resolved: facebookincubator#6738 Reviewed By: amitkdutta Differential Revision: D49977260 Pulled By: mbasmanova fbshipit-source-id: 7c3f96cff6d285bf9f9e2f944640565125eea6d3
GROUPING SETS can be specified with grouping keys that are alias columns of the same input column. This is typically used to compute multiple mixed aggregations on the same key.
e.g.
select COUNT(orderkey), count(distinct orderkey) from orders;
or
SELECT lna, lnb, SUM(quantity) FROM (SELECT linenumber lna, linenumber lnb, CAST(quantity AS BIGINT) quantity FROM lineitem) GROUP BY GROUPING SETS ((lna, lnb), (lna), (lnb), ())
The Velox operator always assumes that :
The Prestissimo code assumes that
Velox/Presto behavior lead to incorrect results. In each grouping set the column or its alias column values all appear in the same output column leading to wrong computations.
Fixes prestodb/presto#20910 and prestodb/presto#20917
Prestissimo side fix prestodb/presto#20964 is also needed.