Should explain union between 3+ queries be flattened? #4856

edongashi · 2020-12-16T07:22:27Z

explain select value from integers union select value from integers union select value from integers

query plan
----------
query (standard)
  --> union (standard)
    --> union (standard)
      --> integers (standard)
        --> integers (non-personal table)
    --> union (standard)
      --> integers (standard)
        --> integers (non-personal table)
  --> union (standard)
    --> integers (standard)

Should all unions be at the same level like joins are?

explain select count(*)
from integers i1
inner join integers i2 on i1.user_id = i2.user_id
inner join integers i3 on i1.user_id = i3.user_id

query plan
----------
query (standard)
  --> i1 (standard)
    --> integers (non-personal table)
  --> i2 (standard)
    --> integers (non-personal table)
  --> i3 (standard)
    --> integers (non-personal table)

The text was updated successfully, but these errors were encountered:

cristianberneanu · 2020-12-16T08:41:34Z

I'm not even sure joins should be on the same level.
The order of the operations might matter in some cases.
How is Postgres doing it?

edongashi · 2020-12-16T09:33:43Z

With empty dummy tables:

explain select count(*) from t1 inner join t2 on t1.id=t2.id inner join t3 on t1.id=t3.id;

                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Aggregate  (cost=190.03..190.04 rows=1 width=8)
   ->  Hash Join  (cost=134.75..183.66 rows=2550 width=0)
         Hash Cond: (t1.id = t3.id)
         ->  Hash Join  (cost=67.38..109.58 rows=2550 width=8)
               Hash Cond: (t1.id = t2.id)
               ->  Seq Scan on t1  (cost=0.00..35.50 rows=2550 width=4)
               ->  Hash  (cost=35.50..35.50 rows=2550 width=4)
                     ->  Seq Scan on t2  (cost=0.00..35.50 rows=2550 width=4)
         ->  Hash  (cost=35.50..35.50 rows=2550 width=4)

With a more real database:

explain select * 
from sales.Store
inner join sales.Customer on sales.Store.BusinessEntityID = sales.Customer.StoreID
inner join sales.SalesOrderHeader on sales.Customer.CustomerID = sales.SalesOrderHeader.CustomerID
inner join sales.SalesOrderDetail on sales.SalesOrderHeader.SalesOrderID = sales.SalesOrderDetail.SalesOrderID;

                                                                       QUERY PLAN                                                                        
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=673.43..11086.32 rows=121317 width=1010)
   Hash Cond: (customer.storeid = store.businessentityid)
   ->  Hash Join  (cost=612.65..10706.95 rows=121317 width=535)
         Hash Cond: (salesorderheader.customerid = customer.customerid)
         ->  Merge Join  (cost=0.70..9776.48 rows=121317 width=495)
               Merge Cond: (salesorderheader.salesorderid = salesorderdetail.salesorderid)
               ->  Index Scan using "PK_SalesOrderHeader_SalesOrderID" on salesorderheader  (cost=0.29..1549.26 rows=31465 width=431)
               ->  Index Scan using "PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID" on salesorderdetail  (cost=0.42..6632.30 rows=121317 width=64)
         ->  Hash  (cost=364.20..364.20 rows=19820 width=40)
               ->  Seq Scan on customer  (cost=0.00..364.20 rows=19820 width=40)
   ->  Hash  (cost=52.01..52.01 rows=701 width=475)
         ->  Seq Scan on store  (cost=0.00..52.01 rows=701 width=475)
(12 rows)

Worthy of note that join order matters in DBs. Postgres uses heuristics (GEQO) to determine a good join order. When there's a low number of paths it does an exhaustive search to find the optimal solution.

cristianberneanu · 2020-12-17T13:47:33Z

I gave this more thought and I think the joins display is the problematic part. Besides the fact that the join tree is collapsed, the information about which joins are emulated is not shown either. Not convinced it is worth fixing though.

edongashi · 2020-12-17T14:56:13Z

For postgres UNION returns a flat plan. The union/append operator is monoidal, meaning it's associative.

postgres=# explain select * from t1 union select * from t2 union select * from t3;
                            QUERY PLAN                            
------------------------------------------------------------------
 HashAggregate  (cost=240.38..316.88 rows=7650 width=4)
   Group Key: t1.id
   ->  Append  (cost=0.00..221.25 rows=7650 width=4)
         ->  Seq Scan on t1  (cost=0.00..35.50 rows=2550 width=4)
         ->  Seq Scan on t2  (cost=0.00..35.50 rows=2550 width=4)
         ->  Seq Scan on t3  (cost=0.00..35.50 rows=2550 width=4)
(6 rows)

For JOINs it's reasonable that it's a tree in DBs, because performance can vary between different paths.
It also shows JOIN condition between relations, which we don't currently do. I think showing join conditions in the plan would be good.

Though if we indicate any join priority it may be misleading because the DB to which it's offloaded may decide on a different order. At least we should show the order it's done when emulated.

cristianberneanu · 2020-12-17T15:09:18Z

The union/append operator is monoidal, meaning it's associative.

This is not true in the general case.

(select 1 union distinct select 2) union all select 1;

is different from

select 1 union distinct (select 2 union all select 1);

For JOINs it's reasonable that it's a tree in DBs, because performance can vary between different paths.

I don't think the JOIN performance is relevant here. OUTER JOINs are not associative.

edongashi · 2020-12-17T15:27:18Z

I didn't know we supported union all. Should that be indicated in the explain plan? Currently both of the following return the same explanation:

explain select count(*) from test union all select count(*) from test

explain select count(*) from test union distinct select count(*) from test

cristianberneanu · 2020-12-17T15:35:49Z

Should that be indicated in the explain plan?

Yes, I think that would make sense.

edongashi · 2020-12-17T15:37:05Z

I think we shouldn't try to strictly respect associativity.

The original idea of explain has been to tell types of queries and see what's emulated:
https://attack.aircloak.com/docs/#/datastores?id=emulation-overview

I'm not sure how emulated joins work and if we currently represent those.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should explain union between 3+ queries be flattened? #4856

Should explain union between 3+ queries be flattened? #4856

edongashi commented Dec 16, 2020 •

edited

Loading

cristianberneanu commented Dec 16, 2020

edongashi commented Dec 16, 2020

cristianberneanu commented Dec 17, 2020

edongashi commented Dec 17, 2020

cristianberneanu commented Dec 17, 2020

edongashi commented Dec 17, 2020

cristianberneanu commented Dec 17, 2020

edongashi commented Dec 17, 2020

Should explain union between 3+ queries be flattened? #4856

Should explain union between 3+ queries be flattened? #4856

Comments

edongashi commented Dec 16, 2020 • edited Loading

cristianberneanu commented Dec 16, 2020

edongashi commented Dec 16, 2020

cristianberneanu commented Dec 17, 2020

edongashi commented Dec 17, 2020

cristianberneanu commented Dec 17, 2020

edongashi commented Dec 17, 2020

cristianberneanu commented Dec 17, 2020

edongashi commented Dec 17, 2020

edongashi commented Dec 16, 2020 •

edited

Loading