Optimize DISTINCT, ORDER BY and DISTINCT ON when Aggregation without Group By. #685

avamingli · 2024-10-24T13:28:16Z

For query which has Aggregation but without Group by clause, the DISTINCT/DISTINCT ON/ORDER BY clause could be removed as there would be one row returned at most.
And there is no necessary to do unique or sort.
This can simply the plan, and process less expressions like: Aggref nodes during planner.

DISTINCT

explain(verbose, costs off)
select distinct count(a), sum(b) from t_distinct_sort ;
                               QUERY PLAN
------------------------------------------------------------------------
 Unique
   Output: (count(a)), (sum(b))
   Group Key: (count(a)), (sum(b))
   ->  Sort
         Output: (count(a)), (sum(b))
         Sort Key: (count(t_distinct_sort.a)), (sum(t_distinct_sort.b))
         ->  Finalize Aggregate
               Output: count(a), sum(b)
               ->  Gather Motion 3:1  (slice1; segments: 3)
                     Output: (PARTIAL count(a)), (PARTIAL sum(b))
                     ->  Partial Aggregate
                           Output: PARTIAL count(a), PARTIAL sum(b)
                           ->  Seq Scan on public.t_distinct_sort
                                 Output: a, b, c
 Settings: optimizer = 'off'
 Optimizer: Postgres query optimizer
(16 rows)

After this commit:

explain(verbose, costs off)
select distinct count(a), sum(b) from t_distinct_sort ;
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate
   Output: count(a), sum(b)
   ->  Gather Motion 3:1  (slice1; segments: 3)
         Output: (PARTIAL count(a)), (PARTIAL sum(b))
         ->  Partial Aggregate
               Output: PARTIAL count(a), PARTIAL sum(b)
               ->  Seq Scan on public.t_distinct_sort
                     Output: a, b, c
  Optimizer: Postgres query optimizer
(10 rows)

DISTINCT ON and ORDER BY

select distinct on(count(b), count(c)) count(a), sum(b) from t_distinct_sort order by count(c);
                           QUERY PLAN
--------------------------------------------------------------------
 Unique
   Output: (count(a)), (sum(b)), (count(c)), (count(b))
   Group Key: (count(c)), (count(b))
   ->  Sort
         Output: (count(a)), (sum(b)), (count(c)), (count(b))
         Sort Key: (count(t_distinct_sort.c)),
(count(t_distinct_sort.b))
         ->  Finalize Aggregate
               Output: count(a), sum(b), count(c), count(b)
               ->  Gather Motion 3:1  (slice1; segments: 3)
                     Output: (PARTIAL count(a)), (PARTIAL sum(b)),
(PARTIAL count(c)), (PARTIAL count(b))
                     ->  Partial Aggregate
                           Output: PARTIAL count(a), PARTIAL sum(b),
PARTIAL count(c), PARTIAL count(b)
                           ->  Seq Scan on public.t_distinct_sort
                                 Output: a, b, c

After this commit:

select distinct on(count(b), count(c)) count(a), sum(b) from t_distinct_sort order by count(c);
                      QUERY PLAN
--------------------------------------------------------
 Finalize Aggregate
   Output: count(a), sum(b)
   ->  Gather Motion 3:1  (slice1; segments: 3)
         Output: (PARTIAL count(a)), (PARTIAL sum(b))
         ->  Partial Aggregate
               Output: PARTIAL count(a), PARTIAL sum(b)
               ->  Seq Scan on public.t_distinct_sort
                     Output: a, b, c
 Optimizer: Postgres query optimizer

ORDER BY

explain(verbose, costs off)
select count(a), sum(b) from t_distinct_sort order by sum(a), count(c);
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Sort
   Output: (count(a)), (sum(b)), (sum(a)), (count(c))
   Sort Key: (sum(t_distinct_sort.a)), (count(t_distinct_sort.c))
   ->  Finalize Aggregate
         Output: count(a), sum(b), sum(a), count(c)
         ->  Gather Motion 3:1  (slice1; segments: 3)
               Output: (PARTIAL count(a)), (PARTIAL sum(b)), (PARTIAL sum(a)), (PARTIAL count(c))
               ->  Partial Aggregate
                     Output: PARTIAL count(a), PARTIAL sum(b), PARTIAL sum(a), PARTIAL count(c)
                     ->  Seq Scan on public.t_distinct_sort
                           Output: a, b, c
 Settings: optimizer = 'off'
 Optimizer: Postgres query optimizer
(13 rows)

After this commit:

explain(verbose, costs off)
select count(a), sum(b) from t_distinct_sort order by sum(a), count(c);
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate
   Output: count(a), sum(b)
   ->  Gather Motion 3:1  (slice1; segments: 3)
         Output: (PARTIAL count(a)), (PARTIAL sum(b))
         ->  Partial Aggregate
               Output: PARTIAL count(a), PARTIAL sum(b)
               ->  Seq Scan on public.t_distinct_sort
                     Output: a, b, c
 Optimizer: Postgres query optimizer
(10 rows)

DISTINCT and ORDER BY

select distinct count(a), sum(b) from t_distinct_sort order by sum(b), count(a);
                               QUERY PLAN
------------------------------------------------------------------------
 Unique
   Output: (count(a)), (sum(b))
   Group Key: (sum(b)), (count(a))
   ->  Sort
         Output: (count(a)), (sum(b))
         Sort Key: (sum(t_distinct_sort.b)), (count(t_distinct_sort.a))
         ->  Finalize Aggregate
               Output: count(a), sum(b)
               ->  Gather Motion 3:1  (slice1; segments: 3)
                     Output: (PARTIAL count(a)), (PARTIAL sum(b))
                     ->  Partial Aggregate
                           Output: PARTIAL count(a), PARTIAL sum(b)
                           ->  Seq Scan on public.t_distinct_sort
                                 Output: a, b, c
 Settings: optimizer = 'off'
 Optimizer: Postgres query optimizer
(16 rows)

After this commit:

select distinct count(a), sum(b) from t_distinct_sort order by sum(b), count(a);
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate
   Output: count(a), sum(b)
   ->  Gather Motion 3:1  (slice1; segments: 3)
         Output: (PARTIAL count(a)), (PARTIAL sum(b))
         ->  Partial Aggregate
               Output: PARTIAL count(a), PARTIAL sum(b)
               ->  Seq Scan on public.t_distinct_sort
                     Output: a, b, c
  Optimizer: Postgres query optimizer
(10 rows)

Authored-by: Zhang Mingli [email protected]

fix #ISSUE_Number

Change logs

Describe your change clearly, including what problem is being solved or what feature is being added.

If it has some breaking backward or forward compatibility, please clary.

Why are the changes needed?

Describe why the changes are necessary.

Does this PR introduce any user-facing change?

If yes, please clarify the previous behavior and the change this PR proposes.

How was this patch tested?

Please detail how the changes were tested, including manual tests and any relevant unit or integration tests.

Contributor's Checklist

Here are some reminders and checklists before/when submitting your pull request, please check them:

Make sure your Pull Request has a clear title and commit message. You can take git-commit template as a reference.
Sign the Contributor License Agreement as prompted for your first-time contribution(One-time setup).
Learn the coding contribution guide, including our code conventions, workflow and more.
List your communication in the GitHub Issues or Discussions (if has or needed).
Document changes.
Add tests for the change
Pass make installcheck
Pass make -C src/test installcheck-cbdb-parallel
Feel free to request cloudberrydb/dev team for review and approval when your PR is ready🥳

avamingli · 2024-10-25T01:42:54Z

Many plan diffs, will fix later.

avamingli · 2024-11-04T07:04:20Z

For query which has Aggregation but without Group by clause, the DISTINCT/DISTINCT ON/ORDER BY clause could be removed as there would be one row returned at most.

SRF will break the assumption.

 select count(*), generate_series(1, 4) from t1;
 count | generate_series
-------+-----------------
     3 |               1
     3 |               2
     3 |               3
     3 |               4
(4 rows)

Fix it and Postgres' WITH ORDINALITY as well.

fanfuxiaoran · 2024-11-18T09:41:10Z

I took a look at orca, it has already optimized distinct function.

explain  select  distinct(count(a)) from foo;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=0.00..526.96 rows=1 width=8)
   ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..526.96 rows=1 width=8)
         ->  Partial Aggregate  (cost=0.00..526.96 rows=1 width=8)
               ->  Seq Scan on foo  (cost=0.00..500.67 rows=3333334 width=4)
 Optimizer: Pivotal Optimizer (GPORCA)
(5 rows)

Even if with group by , the distinct also can be removed

explain  select  distinct(count(a)) from foo group by a ;
                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..1395.69 rows=1000 width=8)
   ->  HashAggregate  (cost=0.00..1395.66 rows=334 width=8)
         Group Key: (count(a))
         ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=0.00..1395.62 rows=334 width=8)
               Hash Key: (count(a))
               ->  Streaming HashAggregate  (cost=0.00..1395.61 rows=334 width=8)
                     Group Key: count(a)
                     ->  HashAggregate  (cost=0.00..985.15 rows=3333334 width=8)
                           Group Key: a
                           Planned Partitions: 16
                           ->  Redistribute Motion 3:3  (slice3; segments: 3)  (cost=0.00..567.20 rows=3333334 width=4)
                                 Hash Key: a
                                 ->  Seq Scan on foo  (cost=0.00..500.67 rows=3333334 width=4)
 Optimizer: Pivotal Optimizer (GPORCA)
(14 rows)

as distinct is a function which only works in a group.

The function called PexprRemoveSuperfluousDistinctInDQA in orca.

avamingli · 2024-11-21T08:06:12Z

I took a look at orca, it has already optimized distinct function.

explain  select  distinct(count(a)) from foo;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=0.00..526.96 rows=1 width=8)
   ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..526.96 rows=1 width=8)
         ->  Partial Aggregate  (cost=0.00..526.96 rows=1 width=8)
               ->  Seq Scan on foo  (cost=0.00..500.67 rows=3333334 width=4)
 Optimizer: Pivotal Optimizer (GPORCA)
(5 rows)

Even if with group by , the distinct also can be removed

explain  select  distinct(count(a)) from foo group by a ;
                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..1395.69 rows=1000 width=8)
   ->  HashAggregate  (cost=0.00..1395.66 rows=334 width=8)
         Group Key: (count(a))
         ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=0.00..1395.62 rows=334 width=8)
               Hash Key: (count(a))
               ->  Streaming HashAggregate  (cost=0.00..1395.61 rows=334 width=8)
                     Group Key: count(a)
                     ->  HashAggregate  (cost=0.00..985.15 rows=3333334 width=8)
                           Group Key: a
                           Planned Partitions: 16
                           ->  Redistribute Motion 3:3  (slice3; segments: 3)  (cost=0.00..567.20 rows=3333334 width=4)
                                 Hash Key: a
                                 ->  Seq Scan on foo  (cost=0.00..500.67 rows=3333334 width=4)
 Optimizer: Pivotal Optimizer (GPORCA)
(14 rows)

as distinct is a function which only works in a group.

The function called PexprRemoveSuperfluousDistinctInDQA in orca.

Yeah, see #677 (reply in thread)

src/backend/optimizer/plan/transform.c

fanfuxiaoran · 2024-11-26T09:24:30Z

I took a look at orca, it has already optimized distinct function.

explain  select  distinct(count(a)) from foo;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=0.00..526.96 rows=1 width=8)
   ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..526.96 rows=1 width=8)
         ->  Partial Aggregate  (cost=0.00..526.96 rows=1 width=8)
               ->  Seq Scan on foo  (cost=0.00..500.67 rows=3333334 width=4)
 Optimizer: Pivotal Optimizer (GPORCA)
(5 rows)

Even if with group by , the distinct also can be removed

explain  select  distinct(count(a)) from foo group by a ;
                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..1395.69 rows=1000 width=8)
   ->  HashAggregate  (cost=0.00..1395.66 rows=334 width=8)
         Group Key: (count(a))
         ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=0.00..1395.62 rows=334 width=8)
               Hash Key: (count(a))
               ->  Streaming HashAggregate  (cost=0.00..1395.61 rows=334 width=8)
                     Group Key: count(a)
                     ->  HashAggregate  (cost=0.00..985.15 rows=3333334 width=8)
                           Group Key: a
                           Planned Partitions: 16
                           ->  Redistribute Motion 3:3  (slice3; segments: 3)  (cost=0.00..567.20 rows=3333334 width=4)
                                 Hash Key: a
                                 ->  Seq Scan on foo  (cost=0.00..500.67 rows=3333334 width=4)
 Optimizer: Pivotal Optimizer (GPORCA)
(14 rows)

as distinct is a function which only works in a group.
The function called PexprRemoveSuperfluousDistinctInDQA in orca.

Yeah, see #677 (reply in thread)

Orca removed the distinct expression when it is used on the agg expression even if there is group by clause, do we need to consider that?

avamingli · 2024-11-26T09:31:36Z

Orca removed the distinct expression when it is used on the agg expression even if there is group by clause, do we need to consider that?

We can consider that type of optimization in the future.
In line with the goals of this PR, we have optimized statements like DISTINCT, DISTINCT ON, ORDER BY, and LIMIT.
However, given the presence of GROUP BY, the optimizations for ORDER BY and LIMIT may no longer apply.
With GROUP BY clause should be another topic for optimization.

For query which has Aggregation but without Group by clause, the DISTINCT/DISTINCT ON/ORDER BY clause could be removed as there would be one row returned at most. And there is no necessary to do unique or sort. This can simply the plan, and process less Aggref nodes during planner. select distinct on(count(b), count(c)) count(a), sum(b) from t_distinct_sort order by count(c); QUERY PLAN -------------------------------------------------------------------- Unique Output: (count(a)), (sum(b)), (count(c)), (count(b)) Group Key: (count(c)), (count(b)) -> Sort Output: (count(a)), (sum(b)), (count(c)), (count(b)) Sort Key: (count(t_distinct_sort.c)), (count(t_distinct_sort.b)) -> Finalize Aggregate Output: count(a), sum(b), count(c), count(b) -> Gather Motion 3:1 (slice1; segments: 3) Output: (PARTIAL count(a)), (PARTIAL sum(b)), (PARTIAL count(c)), (PARTIAL count(b)) -> Partial Aggregate Output: PARTIAL count(a), PARTIAL sum(b), PARTIAL count(c), PARTIAL count(b) -> Seq Scan on public.t_distinct_sort Output: a, b, c After this commit: select distinct on(count(b), count(c)) count(a), sum(b) from t_distinct_sort order by count(c); QUERY PLAN -------------------------------------------------------- Finalize Aggregate Output: count(a), sum(b) -> Gather Motion 3:1 (slice1; segments: 3) Output: (PARTIAL count(a)), (PARTIAL sum(b)) -> Partial Aggregate Output: PARTIAL count(a), PARTIAL sum(b) -> Seq Scan on public.t_distinct_sort Output: a, b, c Optimizer: Postgres query optimizer Authored-by: Zhang Mingli [email protected]

avamingli force-pushed the opt_dist_sort_on_agg branch from 4a01111 to 1d31349 Compare November 4, 2024 06:58

avamingli requested a review from gfphoenix78 November 5, 2024 07:52

avamingli mentioned this pull request Nov 25, 2024

[AQUMV] Answer Aggregation Query Directly. #705

Open

9 tasks

avamingli force-pushed the opt_dist_sort_on_agg branch from 1d31349 to 66614af Compare November 26, 2024 05:22

fanfuxiaoran reviewed Nov 26, 2024

View reviewed changes

src/backend/optimizer/plan/transform.c Show resolved Hide resolved

src/backend/optimizer/plan/transform.c Show resolved Hide resolved

avamingli force-pushed the opt_dist_sort_on_agg branch from 66614af to 96d4e42 Compare November 27, 2024 07:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize DISTINCT, ORDER BY and DISTINCT ON when Aggregation without Group By. #685

Optimize DISTINCT, ORDER BY and DISTINCT ON when Aggregation without Group By. #685

avamingli commented Oct 24, 2024

avamingli commented Oct 25, 2024

avamingli commented Nov 4, 2024 •

edited

Loading

fanfuxiaoran commented Nov 18, 2024 •

edited

Loading

avamingli commented Nov 21, 2024

fanfuxiaoran commented Nov 26, 2024

avamingli commented Nov 26, 2024

Optimize DISTINCT, ORDER BY and DISTINCT ON when Aggregation without Group By. #685

Are you sure you want to change the base?

Optimize DISTINCT, ORDER BY and DISTINCT ON when Aggregation without Group By. #685

Conversation

avamingli commented Oct 24, 2024

DISTINCT

DISTINCT ON and ORDER BY

ORDER BY

DISTINCT and ORDER BY

Change logs

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Contributor's Checklist

avamingli commented Oct 25, 2024

avamingli commented Nov 4, 2024 • edited Loading

fanfuxiaoran commented Nov 18, 2024 • edited Loading

avamingli commented Nov 21, 2024

fanfuxiaoran commented Nov 26, 2024

avamingli commented Nov 26, 2024

avamingli commented Nov 4, 2024 •

edited

Loading

fanfuxiaoran commented Nov 18, 2024 •

edited

Loading