Skip to content
This repository has been archived by the owner on Jul 2, 2024. It is now read-only.

Should explain union between 3+ queries be flattened? #4856

Open
edongashi opened this issue Dec 16, 2020 · 8 comments
Open

Should explain union between 3+ queries be flattened? #4856

edongashi opened this issue Dec 16, 2020 · 8 comments

Comments

@edongashi
Copy link
Member

edongashi commented Dec 16, 2020

explain select value from integers union select value from integers union select value from integers
query plan
----------
query (standard)
  --> union (standard)
    --> union (standard)
      --> integers (standard)
        --> integers (non-personal table)
    --> union (standard)
      --> integers (standard)
        --> integers (non-personal table)
  --> union (standard)
    --> integers (standard)

Should all unions be at the same level like joins are?

explain select count(*)
from integers i1
inner join integers i2 on i1.user_id = i2.user_id
inner join integers i3 on i1.user_id = i3.user_id
query plan
----------
query (standard)
  --> i1 (standard)
    --> integers (non-personal table)
  --> i2 (standard)
    --> integers (non-personal table)
  --> i3 (standard)
    --> integers (non-personal table)
@cristianberneanu
Copy link
Member

I'm not even sure joins should be on the same level.
The order of the operations might matter in some cases.
How is Postgres doing it?

@edongashi
Copy link
Member Author

With empty dummy tables:

explain select count(*) from t1 inner join t2 on t1.id=t2.id inner join t3 on t1.id=t3.id;
                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Aggregate  (cost=190.03..190.04 rows=1 width=8)
   ->  Hash Join  (cost=134.75..183.66 rows=2550 width=0)
         Hash Cond: (t1.id = t3.id)
         ->  Hash Join  (cost=67.38..109.58 rows=2550 width=8)
               Hash Cond: (t1.id = t2.id)
               ->  Seq Scan on t1  (cost=0.00..35.50 rows=2550 width=4)
               ->  Hash  (cost=35.50..35.50 rows=2550 width=4)
                     ->  Seq Scan on t2  (cost=0.00..35.50 rows=2550 width=4)
         ->  Hash  (cost=35.50..35.50 rows=2550 width=4)

With a more real database:

explain select * 
from sales.Store
inner join sales.Customer on sales.Store.BusinessEntityID = sales.Customer.StoreID
inner join sales.SalesOrderHeader on sales.Customer.CustomerID = sales.SalesOrderHeader.CustomerID
inner join sales.SalesOrderDetail on sales.SalesOrderHeader.SalesOrderID = sales.SalesOrderDetail.SalesOrderID;
                                                                       QUERY PLAN                                                                        
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=673.43..11086.32 rows=121317 width=1010)
   Hash Cond: (customer.storeid = store.businessentityid)
   ->  Hash Join  (cost=612.65..10706.95 rows=121317 width=535)
         Hash Cond: (salesorderheader.customerid = customer.customerid)
         ->  Merge Join  (cost=0.70..9776.48 rows=121317 width=495)
               Merge Cond: (salesorderheader.salesorderid = salesorderdetail.salesorderid)
               ->  Index Scan using "PK_SalesOrderHeader_SalesOrderID" on salesorderheader  (cost=0.29..1549.26 rows=31465 width=431)
               ->  Index Scan using "PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID" on salesorderdetail  (cost=0.42..6632.30 rows=121317 width=64)
         ->  Hash  (cost=364.20..364.20 rows=19820 width=40)
               ->  Seq Scan on customer  (cost=0.00..364.20 rows=19820 width=40)
   ->  Hash  (cost=52.01..52.01 rows=701 width=475)
         ->  Seq Scan on store  (cost=0.00..52.01 rows=701 width=475)
(12 rows)

Worthy of note that join order matters in DBs. Postgres uses heuristics (GEQO) to determine a good join order. When there's a low number of paths it does an exhaustive search to find the optimal solution.

@cristianberneanu
Copy link
Member

I gave this more thought and I think the joins display is the problematic part. Besides the fact that the join tree is collapsed, the information about which joins are emulated is not shown either. Not convinced it is worth fixing though.

@edongashi
Copy link
Member Author

For postgres UNION returns a flat plan. The union/append operator is monoidal, meaning it's associative.

postgres=# explain select * from t1 union select * from t2 union select * from t3;
                            QUERY PLAN                            
------------------------------------------------------------------
 HashAggregate  (cost=240.38..316.88 rows=7650 width=4)
   Group Key: t1.id
   ->  Append  (cost=0.00..221.25 rows=7650 width=4)
         ->  Seq Scan on t1  (cost=0.00..35.50 rows=2550 width=4)
         ->  Seq Scan on t2  (cost=0.00..35.50 rows=2550 width=4)
         ->  Seq Scan on t3  (cost=0.00..35.50 rows=2550 width=4)
(6 rows)

For JOINs it's reasonable that it's a tree in DBs, because performance can vary between different paths.
It also shows JOIN condition between relations, which we don't currently do. I think showing join conditions in the plan would be good.

Though if we indicate any join priority it may be misleading because the DB to which it's offloaded may decide on a different order. At least we should show the order it's done when emulated.

@cristianberneanu
Copy link
Member

The union/append operator is monoidal, meaning it's associative.

This is not true in the general case.

(select 1 union distinct select 2) union all select 1;

is different from

select 1 union distinct (select 2 union all select 1);

For JOINs it's reasonable that it's a tree in DBs, because performance can vary between different paths.

I don't think the JOIN performance is relevant here. OUTER JOINs are not associative.

@edongashi
Copy link
Member Author

I didn't know we supported union all. Should that be indicated in the explain plan? Currently both of the following return the same explanation:

explain select count(*) from test union all select count(*) from test
explain select count(*) from test union distinct select count(*) from test

@cristianberneanu
Copy link
Member

Should that be indicated in the explain plan?

Yes, I think that would make sense.

@edongashi
Copy link
Member Author

I think we shouldn't try to strictly respect associativity.

The original idea of explain has been to tell types of queries and see what's emulated:
https://attack.aircloak.com/docs/#/datastores?id=emulation-overview

I'm not sure how emulated joins work and if we currently represent those.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants