-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New Optimizer
(a place to jot down notes about the new plan representation & optimizer)
Work in progress being tracked under https://github.com/martint/presto/tree/optimizer
-
Query plan is modeled as a "program" using intermediate representation comprised by function calls and assignments. The logical type of each expression is some form of relation/collection/stream-of-rows.
-
For each relational expression, we can derive:
- Logical properties such as predicate, uniqueness, type (schema), functional dependencies between fields.
- Physical properties such as global partitioning, local ordering & grouping
-
Functions can be logical and/or physical (i.e., if they can be directly executed: join vs hash-join)
-
Possibly multiple optimizer implementations: heuristics/rewrites, cost-based, etc (TBD)
-
Cost-based optimizer
- Cascades-style
- Components:
- Rules
- Pattern + named arguments + required properties
- Can produce multiple expressions
- Types: logical transformation (e.g., push filter through project), implementation (join -> hash join), enforcement (sort before merge). The may not need to be explicitly identified as such.
- Memo
- Holds equivalence classes (name + list of expressions)
- Memoizes optimization goals (i.e., best expression for a given equivalence class and physical requirements)
- Cost
- Rules
- optimizer "state"
- tracks known equivalences
- memoizes optimization goals (expression + required properties)
- used to decide whether a rule can match a give expression tree shape
- support capturing variables: filter(x:project(<any>))
To allow matching based on attributes not expressible via nesting structure. For example:
a:filter(b:project(<any>))
where,
- isDeterministic(a.condition)
- b.projections.allMatch(p -> isDeterministic(p))
Some rules may need to match and be able to inspect arbitrary trees that cannot be expressed by a simple structural pattern.
Given pattern f1(x:<any>)
and the following equivalence structure:
a := {f1(b)}
b := {g1(c), g2(d)}
c := {k1, k2}
d := {j1, j2}
shallow iteration produces:
f1(x), x = g1(c), c is opaque (trying to resolve it causes an error)
f1(x), x = g2(d), d is opaque (trying to resolve it causes an error)
deep iteration produces:
f1(x), x = g1(c), c = k1
f1(x), x = g1(c), c = k2
f1(x), x = g2(d), d = j1
f1(x), x = g2(d), d = j2
- For a pattern like
f1(<any>, <any>)
and the following equivalence:
a := { f(b, b) }
b := { k1, k2 }
Should the matcher assume b
refers to the same underlying expression and produce?
f(b, b), b = k1
f(b, b), b = k2
or should it enumerate the cartesian product?
f(b1, b2), b1 = k1, b2 = k1
f(b1, b2), b1 = k1, b2 = k2
f(b1, b2), b1 = k2, b2 = k1
f(b1, b2), b1 = k2, b2 = k2
One downside of enumerating the cartesian product is that it requires creating new instances of the FunctionCall object on each match to replace the variable references for each input with a synthetic ones. In the first scenario, the matcher can hand the (immutable) expression object directly to the rule.
On the other hand, it may result in not exploring the entire set of alternatives.
start:
- break up expression into single-assignment expression
- add each assignment to the memo in a separate equivalence class
- optimize(root class, unbounded cost, no physical reqs)
optimize(equivalence class, cost bound, requirements):
- initialize exploration queue (rule + top operator in equivalence class)
- find potential match candidates and add them to queue
- while queue is not empty
- enumerate bindings for each named argument (by iterating over all expressions
in each equivalence class that's part of the match)
- if binding + physical requirements can be handled by rule
- apply rule
- for each expression generated by rule
- add to memo
- if top function is physical
- determine cost bound for children
- for each input
- derive required physical properties & cost upper bound
- optimize corresponding equivalence class
with required properties and upper bound
- update max bound for remaining children
- find additional potential matches and enqueue
- properties rules must satisfy:
- they need to converge. E.g.,
limit(union(x)) -> limit(union(limit(x)))
. Can cause the rule to fire to infinity. - they need to avoid emitting an expression that is equal to the input expression. Otherwise, the engine may loop forever.
- they need to converge. E.g.,
- pattern matching needs to be able to handle cycles and avoid considering groups that have been visited in the matching process. Otherwise, it can cause wildcard matchers to recurse forever.
- need a way for rules to tell the engine that it should suppress invocation for other expressions. E.g.,
union(union(union(x, y), z), w) -> union(x, y, z, w)
will fire for every sub-expression unnecessarily.
- how to prioritize exploration candidates
- memoize rule application to prevent re-exploration in case of repeated optimization calls (with different physical requirements)
- we may need a way for a rule to short-circuit other exploration tasks for a given group (e.g., after constant folding)
- we may need a way for a rule to prevent application of the same rule on expressions produced by the first application (e.g., join commutativity)
- how should enforcer rules work?
- guaranteeing evaluation order semantics vs predicate pushdown (e.g, WHERE clause vs predicates in subquery)
These are some examples from an early prototype. They are visualizations of the memo and transformation processes for simple rules. Each cluster is an equivalence class. Circles are the names of the equivalence class, and the rectangles are the expressions in the class. Arrows from expression to equivalence classes are parent-child relationships. Blue arrows indicate the result of applying a transformation rule.