Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GroupId with aliased grouping key columns #6738

Closed
wants to merge 1 commit into from
Closed

Conversation

aditi-pandit
Copy link
Collaborator

@aditi-pandit aditi-pandit commented Sep 26, 2023

GROUPING SETS can be specified with grouping keys that are alias columns of the same input column. This is typically used to compute multiple mixed aggregations on the same key.

e.g.
select COUNT(orderkey), count(distinct orderkey) from orders;
or

SELECT lna, lnb, SUM(quantity) FROM (SELECT linenumber lna, linenumber lnb, CAST(quantity AS BIGINT) quantity FROM lineitem) GROUP BY GROUPING SETS ((lna, lnb), (lna), (lnb), ())

The Velox operator always assumes that :

  • An input column maps to a single output grouping key.

The Prestissimo code assumes that

  • A GroupingSet is specified using input column names. Presto co-ordinator uses output column names for Grouping Sets allowing for multiple uses of the same input column.

Velox/Presto behavior lead to incorrect results. In each grouping set the column or its alias column values all appear in the same output column leading to wrong computations.

Fixes prestodb/presto#20910 and prestodb/presto#20917
Prestissimo side fix prestodb/presto#20964 is also needed.

@netlify
Copy link

netlify bot commented Sep 26, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit ea46371
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/652586685c1dda0008ef7e7f

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 26, 2023
@aditi-pandit aditi-pandit changed the title Fix GROUPING SETS with multiple aliased grouping key columns Fix GROUPING SETS with aliased grouping key columns Sep 26, 2023
@aditi-pandit aditi-pandit changed the title Fix GROUPING SETS with aliased grouping key columns Fix GroupId with aliased grouping key columns Sep 26, 2023
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditi-pandit Aditi, thank you for investigating this issue and coming up with a fix. This is very confusing, so let's figure out how to make this clear for future readers. Let's update comments for GroupIdNode and documentation in https://facebookincubator.github.io/velox/develop/operators.html#groupidnode. I feel that verbal explanation won't be sufficient and wondering if we could add a few examples of GroupNodeId configurations along with sample input and output.

velox/exec/tests/utils/PlanBuilder.h Outdated Show resolved Hide resolved
@aditi-pandit aditi-pandit force-pushed the group_bug branch 2 times, most recently from c788225 to 89209eb Compare September 28, 2023 20:15
@aditi-pandit
Copy link
Collaborator Author

@aditi-pandit Aditi, thank you for investigating this issue and coming up with a fix. This is very confusing, so let's figure out how to make this clear for future readers. Let's update comments for GroupIdNode and documentation in https://facebookincubator.github.io/velox/develop/operators.html#groupidnode. I feel that verbal explanation won't be sufficient and wondering if we could add a few examples of GroupNodeId configurations along with sample input and output.

@mbasmanova : Thanks Masha.

Couple of points:

  • The GroupID operator API remains the same with this change. I clarified in operators.rst that Grouping keys are specified with output names.
  • What was modified was the GroupId API in PlanBuilder since it didn't allow any way to specify an alias column. It built grouping keys from groupingSets. I have now changed that API to specify grouping keys separately so alias columns can be added there. I have also updated its documentation.

In terms of fixes there are 2 changes:
i) The Velox operator didn't take into account the case that a single input column could be mapped to 2. So I changed std::unordered_map<std::string, column_index_t>
inputToOutputGroupingKeyMapping; to an output to input mapping of grouping key columns. This was used further in the operator.
ii) The code on the Prestissimo side had mapped output names to input names when filling GroupingSets. I changed that code to send the output names. That is the fix on the Prestissimo side.

Hope that clarifies the fixes. Please let me know if you have further questions.

@aditi-pandit aditi-pandit force-pushed the group_bug branch 10 times, most recently from 130451d to 4ef2976 Compare October 4, 2023 04:56
@aditi-pandit
Copy link
Collaborator Author

@mbasmanova : Changes made:

  • Add SQL example in operators.rst
  • Add new field groupingSetsColumns_ for using output column names in groupingSet. This will deprecate the original groupingSets_ variable that uses FieldAccessTypeExprPtr. I will remove this code after changing Prestissimo code to use the new field.
  • Change PlanBuilder to pick up ordering of grouping keys from GroupingKeys instead of the order of keys in groupingSets.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditi-pandit Thank you for iterating on this. Some questions. I'll take a closer look a bit later.

velox/core/PlanNode.h Outdated Show resolved Hide resolved
velox/core/PlanNode.h Outdated Show resolved Hide resolved
velox/core/PlanNode.h Outdated Show resolved Hide resolved
velox/docs/develop/operators.rst Outdated Show resolved Hide resolved
velox/docs/develop/operators.rst Outdated Show resolved Hide resolved
velox/core/PlanNode.cpp Outdated Show resolved Hide resolved
velox/exec/GroupId.cpp Outdated Show resolved Hide resolved
@aditi-pandit aditi-pandit force-pushed the group_bug branch 3 times, most recently from 9ff5290 to 33e6bc9 Compare October 5, 2023 22:01
@aditi-pandit
Copy link
Collaborator Author

@mbasmanova : Have fixed the docs and make the backward compatibility code simpler. PTAL.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditi-pandit Thank you for iterating on this PR. Looks great % a few nits.

velox/core/PlanNode.h Show resolved Hide resolved
auto plan = PlanBuilder()
.values({data})
.groupId(
{"o_key", "o_key as o_key_1"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still find this use case confusing. It is just not clear why would anything use a plan like this. I wish we could identify a compelling use case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt the query documented with its Presto plan was a better example. But I wasn't able to translate the IF expressions to equivalent CASE expressions. case group_id when 1 then 0 else null end could not compile since null is UNKNOWN type and CASE required all when/then expressions to evaluate to the same type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try "null::bigint"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will send out a follow up PR.

velox/docs/develop/operators.rst Show resolved Hide resolved
velox/docs/develop/operators.rst Outdated Show resolved Hide resolved
velox/docs/develop/operators.rst Outdated Show resolved Hide resolved
velox/docs/develop/operators.rst Outdated Show resolved Hide resolved

In this query the user wants to compute aggregates on the same key, though with
and without the DISTINCT clause. With a particular optimization strategy
optimize.mixed-distinct-aggregations, Presto uses GroupIdNode to compute these.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we link here to Presto's documentation about this property?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this seems useful: trinodb/trino#15927

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link. Have added it.

velox/docs/develop/operators.rst Outdated Show resolved Hide resolved
velox/docs/develop/operators.rst Outdated Show resolved Hide resolved
velox/docs/develop/operators.rst Show resolved Hide resolved
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks.

Any chance you could add a link for "optimize.mixed-distinct-aggregations"?

https://www.qubole.com/blog/presto-optimizes-aggregations-over-distinct-values

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mbasmanova
Copy link
Contributor

mbasmanova commented Oct 6, 2023

@aditi-pandit I'm seeing e2e test failures:

[ERROR] com.facebook.presto.nativeworker.TestPrestoNativeAggregations.testGroupingSets  Time elapsed: 0.327 s  <<< FAILURE!
java.lang.AssertionError: Execution of 'actual' query failed: SELECT orderstatus, orderpriority, count(1), min(orderkey) FROM orders GROUP BY GROUPING SETS ((orderstatus), (orderpriority))
	at org.testng.Assert.fail(Assert.java:98)
	at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:178)
	at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:106)
	at com.facebook.presto.tests.AbstractTestQueryFramework.assertQuery(AbstractTestQueryFramework.java:152)
	at com.facebook.presto.tests.AbstractTestQueryFramework.assertQuery(AbstractTestQueryFramework.java:147)
	at ...
Caused by: java.lang.RuntimeException:  Field not found: orderstatus. Available fields are: orderpriority$gid, orderstatus$gid, expr, orderkey, groupid.
	at com.facebook.presto.tests.AbstractTestingPrestoClient.execute(AbstractTestingPrestoClient.java:124)
	at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:733)
	at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:701)
	at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:175)
	... 19 more
Caused by: VeloxUserError:  Field not found: orderstatus. Available fields are: orderpriority$gid, orderstatus$gid, expr, orderkey, groupid.

@aditi-pandit
Copy link
Collaborator Author

@aditi-pandit I'm seeing e2e test failures:

[ERROR] com.facebook.presto.nativeworker.TestPrestoNativeAggregations.testGroupingSets  Time elapsed: 0.327 s  <<< FAILURE!
java.lang.AssertionError: Execution of 'actual' query failed: SELECT orderstatus, orderpriority, count(1), min(orderkey) FROM orders GROUP BY GROUPING SETS ((orderstatus), (orderpriority))
	at org.testng.Assert.fail(Assert.java:98)
	at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:178)
	at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:106)
	at com.facebook.presto.tests.AbstractTestQueryFramework.assertQuery(AbstractTestQueryFramework.java:152)
	at com.facebook.presto.tests.AbstractTestQueryFramework.assertQuery(AbstractTestQueryFramework.java:147)
	at ...
Caused by: java.lang.RuntimeException:  Field not found: orderstatus. Available fields are: orderpriority$gid, orderstatus$gid, expr, orderkey, groupid.
	at com.facebook.presto.tests.AbstractTestingPrestoClient.execute(AbstractTestingPrestoClient.java:124)
	at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:733)
	at com.facebook.presto.tests.DistributedQueryRunner.execute(DistributedQueryRunner.java:701)
	at com.facebook.presto.tests.QueryAssertions.assertQuery(QueryAssertions.java:175)
	... 19 more
Caused by: VeloxUserError:  Field not found: orderstatus. Available fields are: orderpriority$gid, orderstatus$gid, expr, orderkey, groupid.

@mbasmanova : Taking a look.

@aditi-pandit
Copy link
Collaborator Author

aditi-pandit commented Oct 6, 2023

@mbasmanova : Have fixed the issue. Prestissimo passes groupingSets with input column names so I have to fix this to the output column name in the backward compatible logic of GroupIdNode construction. I had missed that.

Note : The original change needs the follow up prestodb/presto#20964 as well
This PR changes Prestissimo code to use output names for groupingSet columns.

Thanks for your patience.

@aditi-pandit aditi-pandit force-pushed the group_bug branch 2 times, most recently from 5fbb618 to 80a4f8f Compare October 6, 2023 22:03
@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mbasmanova
Copy link
Contributor

@aditi-pandit Aditi, thank you for the quick fix. The changes make sense to me. I'm going to try to merge now.

@aditi-pandit
Copy link
Collaborator Author

@aditi-pandit Aditi, thank you for the quick fix. The changes make sense to me. I'm going to try to merge now.

Thanks @mbasmanova. Appreciate the help.

GROUPING SETS can be specified with grouping keys that are
alias columns of the same input column. This is typically used to
compute multiple mixed aggregations on the same key.

The Velox operator always assumes that only input columns
are specified as grouping keys. Whereas Presto does send
plan fragments correctly specifying the output column name
for such cases.

Due to Velox's assumptions in each result grouping set,
the column or its alias column values all appear in the
same output column leading to incorrect results.
@ethanyzhang
Copy link

@aditi-pandit @mbasmanova Is this merged?

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mbasmanova merged this pull request in b5cf638.

@conbench-facebook
Copy link

Conbench analyzed the 1 benchmark run on commit b5cf638b.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

ericyuliu pushed a commit to ericyuliu/velox that referenced this pull request Oct 12, 2023
Summary:
GROUPING SETS can be specified with grouping keys that are alias columns of the same input column. This is typically used to compute multiple mixed aggregations on the same key.

e.g.
`select COUNT(orderkey), count(distinct orderkey) from orders;`
or

`SELECT lna, lnb, SUM(quantity) FROM (SELECT linenumber lna, linenumber lnb, CAST(quantity AS BIGINT) quantity FROM lineitem) GROUP BY GROUPING SETS ((lna, lnb), (lna), (lnb), ())
`

The Velox operator always assumes that :

- An input column maps to a single output grouping key.

The Prestissimo code assumes that
- A GroupingSet is specified using input column names. Presto co-ordinator uses output column names for Grouping Sets allowing for multiple uses of the same input column.

Velox/Presto behavior lead to incorrect results. In each grouping set the column or its alias column values all appear in the same output column leading to wrong computations.

Fixes prestodb/presto#20910 and prestodb/presto#20917
Prestissimo side fix prestodb/presto#20964 is also needed.

Pull Request resolved: facebookincubator#6738

Reviewed By: amitkdutta

Differential Revision: D49977260

Pulled By: mbasmanova

fbshipit-source-id: 7c3f96cff6d285bf9f9e2f944640565125eea6d3
@aditi-pandit aditi-pandit deleted the group_bug branch June 15, 2024 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Native] - sum(distinct ) gives incorrect results
4 participants