[FLINK-12173][table] Optimize SELECT DISTINCT #25752

yiyutian1 · 2024-12-05T22:10:01Z

What is the purpose of the change

This is to optimize SELECT DISTINCT query from using GroupAgg to DeDuplicate.

Brief change log

This PR implements a new Calcite rule that does the following:

Check if the user is running SELECT DISTINCT query.
Originally we want convert the current plan from StreamPhysicalDeduplicate instead of StreamPhysicalGroupAggregateRule in rowtime. After discussion in OSS community, Lincoln in OSS decided to refactor Deduplicate optimization to defer to StreamPhysicalRank for valid StreamExecDeduplicate node conversion to avoid exceptions. PR
So instead, we are optimizing by converting the current plan from FlinkLogicalRank instead of FlinkLogicalAggregate in rowtime, which has the similar effect as of the original plan.
Modify existing tests to reflect the new optimizer.
We decided that because of this bug, https://issues.apache.org/jira/browse/FLINK-35792, we will avoid optimizing for this particular case.

Verifying this change

Please make sure both new and modified tests in this PR follow the conventions for tests defined in our code quality guide.

This change is already covered by existing tests in the table/planner module. For example:
flink-table/flink-table-planner/src/test/resources/org/apache/flink/table/planner/plan/common/PartialInsertTest.xml

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): ( no )
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: ( no )
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (no)

flinkbot · 2024-12-05T22:16:07Z

CI report:

1610130 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

...ink-table-planner/src/main/scala/org/apache/flink/table/planner/delegation/PlannerBase.scala

...rg/apache/flink/table/planner/plan/optimize/program/FlinkChangelogModeInferenceProgram.scala

jnh5y · 2024-12-06T20:29:01Z

...flink/table/planner/plan/rules/physical/stream/StreamLogicalOptimizeSelectDistinctRule.scala

+ * e.g. {SELECT DISTINCT a, b, c;} will be converted to [[FlinkLogicalRank]] instead of
+ * [[FlinkLogicalAggregate]] in rowtime.
+ */
+class StreamLogicalOptimizeSelectDistinctRule


@yiyutian1 and I worked on this together and we started with a Scala example.

Given that we are working on migrating the Scala rules to Java, could you take a look at migrating this rule to Java?

Thanks for the comment Jim!
Are we migrating from Scala->Java for converters now too, or are we mainly doing it for builtInFunctions?

My impression is that for buildInFunctions we want that migration because auto-generated Java code is hard to maintain, but that doesn't seem to be a problem here.

@snuyanzin , could you provide some feedback here?
Could we try merging this ticket as is, so that the optimizer can be available soon, and then we can do the migration in a separate effort?

it would be great to see green ci first

The CI is green! Woohoo!
Could we get some feedback? @snuyanzin
Many thanks.

...apache/flink/table/planner/plan/rules/physical/stream/StreamPhysicalGroupAggregateRule.scala

...er/src/test/scala/org/apache/flink/table/planner/plan/stream/sql/join/IntervalJoinTest.scala

yiyutian1 · 2024-12-06T22:47:30Z

@flinkbot run azure

yiyutian1 · 2024-12-07T00:36:17Z

@flinkbot run azure

yiyutian1 · 2024-12-08T05:14:26Z

@flinkbot run azure

jnh5y · 2024-12-09T14:43:33Z

@flinkbot run azure

yiyutian1 · 2024-12-09T21:42:16Z

@flinkbot run azure

jnh5y · 2024-12-09T22:39:51Z

@flinkbot run azure

yiyutian1 · 2024-12-10T15:08:28Z

Hi @xuyangzhong, @wuchong @lincoln-lil, could I get your feedback on this PR?
I don't have access to add you as reviewers, therefore pinging here. Your in-time feedback will be deeply appreciated.

snuyanzin · 2024-12-12T12:10:17Z

@xuyangzhong I noticed you discussed some of this changes with @jnh5y in FLINK-35792
Would be great to see your thoughts here as well especially about usage of physical rule in logical set (I'm still confused by this)

//cc @lincoln-lil (i think you might be interested in this as well)

snuyanzin · 2024-12-12T12:12:57Z

...flink/table/planner/plan/rules/physical/stream/StreamLogicalOptimizeSelectDistinctRule.scala

+ * e.g. {SELECT DISTINCT a, b, c;} will be converted to [[FlinkLogicalRank]] instead of
+ * [[FlinkLogicalAggregate]] in rowtime.
+ */
+class StreamLogicalOptimizeSelectDistinctRule


Ideally it would be great to have it converted to java in this PR.
I looked into some previous PR and this is how it was: if a rule was new then it was requested to have it in java
if it is an old one then it's ok to convert it in a separate PR

snuyanzin · 2024-12-12T12:14:26Z

...k-table-planner/src/test/scala/org/apache/flink/table/planner/plan/stream/sql/RankTest.scala

+        |SELECT DISTINCT a, b, c
+        |FROM MyTable


Do we have any IT case for that?

davidradl · 2024-12-13T14:48:23Z

Reviewed by Chi on 12/12/24. appears that this PR is healthily progressing

flinkbot added the component=TableSQL/Planner label Dec 5, 2024

yiyutian1 marked this pull request as ready for review December 6, 2024 19:39

jnh5y reviewed Dec 6, 2024

View reviewed changes

...ink-table-planner/src/main/scala/org/apache/flink/table/planner/delegation/PlannerBase.scala Outdated Show resolved Hide resolved

jnh5y reviewed Dec 6, 2024

View reviewed changes

...rg/apache/flink/table/planner/plan/optimize/program/FlinkChangelogModeInferenceProgram.scala Outdated Show resolved Hide resolved

jnh5y reviewed Dec 6, 2024

View reviewed changes

...apache/flink/table/planner/plan/rules/physical/stream/StreamPhysicalGroupAggregateRule.scala Outdated Show resolved Hide resolved

jnh5y reviewed Dec 6, 2024

View reviewed changes

...er/src/test/scala/org/apache/flink/table/planner/plan/stream/sql/join/IntervalJoinTest.scala Outdated Show resolved Hide resolved

yiyutian1 force-pushed the flink12173 branch from 2781161 to 3a40c09 Compare December 6, 2024 23:13

[FLINK-12173][table] Optimize SELECT DISTINCT

1610130

yiyutian1 force-pushed the flink12173 branch from 3a40c09 to 1610130 Compare December 9, 2024 21:37

yiyutian1 requested a review from snuyanzin December 10, 2024 14:03

snuyanzin reviewed Dec 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-12173][table] Optimize SELECT DISTINCT #25752

[FLINK-12173][table] Optimize SELECT DISTINCT #25752

yiyutian1 commented Dec 5, 2024 •

edited

Loading

flinkbot commented Dec 5, 2024 •

edited

Loading

jnh5y Dec 6, 2024

yiyutian1 Dec 6, 2024 •

edited

Loading

yiyutian1 Dec 9, 2024

snuyanzin Dec 9, 2024

yiyutian1 Dec 10, 2024

yiyutian1 commented Dec 6, 2024

yiyutian1 commented Dec 7, 2024

yiyutian1 commented Dec 8, 2024

jnh5y commented Dec 9, 2024

yiyutian1 commented Dec 9, 2024

jnh5y commented Dec 9, 2024

yiyutian1 commented Dec 10, 2024

snuyanzin commented Dec 12, 2024

snuyanzin Dec 12, 2024

snuyanzin Dec 12, 2024

davidradl commented Dec 13, 2024

[FLINK-12173][table] Optimize SELECT DISTINCT #25752

Are you sure you want to change the base?

[FLINK-12173][table] Optimize SELECT DISTINCT #25752

Conversation

yiyutian1 commented Dec 5, 2024 • edited Loading

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Dec 5, 2024 • edited Loading

CI report:

jnh5y Dec 6, 2024

Choose a reason for hiding this comment

yiyutian1 Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

yiyutian1 Dec 9, 2024

Choose a reason for hiding this comment

snuyanzin Dec 9, 2024

Choose a reason for hiding this comment

yiyutian1 Dec 10, 2024

Choose a reason for hiding this comment

yiyutian1 commented Dec 6, 2024

yiyutian1 commented Dec 7, 2024

yiyutian1 commented Dec 8, 2024

jnh5y commented Dec 9, 2024

yiyutian1 commented Dec 9, 2024

jnh5y commented Dec 9, 2024

yiyutian1 commented Dec 10, 2024

snuyanzin commented Dec 12, 2024

snuyanzin Dec 12, 2024

Choose a reason for hiding this comment

snuyanzin Dec 12, 2024

Choose a reason for hiding this comment

davidradl commented Dec 13, 2024

yiyutian1 commented Dec 5, 2024 •

edited

Loading

flinkbot commented Dec 5, 2024 •

edited

Loading

yiyutian1 Dec 6, 2024 •

edited

Loading