-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Optimizer
to use owned plans and TreeNode
API (10% faster planning)
#9948
Conversation
613fdae
to
b950a9e
Compare
Optimizer
to use TreeNode
API
b950a9e
to
ef189fb
Compare
@@ -59,7 +59,7 @@ pub fn main() -> Result<()> { | |||
|
|||
// then run the optimizer with our custom rule | |||
let optimizer = Optimizer::with_rules(vec![Arc::new(MyOptimizerRule {})]); | |||
let optimized_plan = optimizer.optimize(&analyzed_plan, &config, observe)?; | |||
let optimized_plan = optimizer.optimize(analyzed_plan, &config, observe)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This illustrates the API change -- the optimizer now takes an owned plan rather than a reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A great progress!
@@ -110,7 +110,7 @@ fn test_sql(sql: &str) -> Result<LogicalPlan> { | |||
let optimizer = Optimizer::new(); | |||
// analyze and optimize the logical plan | |||
let plan = analyzer.execute_and_check(&plan, config.options(), |_, _| {})?; | |||
optimizer.optimize(&plan, &config, |_, _| {}) | |||
optimizer.optimize(plan, &config, |_, _| {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A large amount of this PR is changes to test to pass in an owned plan
|
||
let formatted_plan = format!("{optimized_plan:?}"); | ||
assert_eq!(formatted_plan, expected); | ||
assert_eq!(plan.schema(), optimized_plan.schema()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the tests to call Optimizer::optimize
directly, which already checks the schema doesn't change, so this test is redundant
This applies to several other changes in this PR
/// | ||
/// Notice: **sometime** result after optimize still can be optimized, we need apply again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this comment is applicable anymore -- the optimizer handles the recursion internally as well as applying multiple optimizer passes
@@ -356,97 +423,22 @@ impl Optimizer { | |||
debug!("Optimizer took {} ms", start_time.elapsed().as_millis()); | |||
Ok(new_plan) | |||
} | |||
|
|||
fn optimize_node( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code implemented plan recursion within the optimizer and is (now) redundant with the TreeNode
API
Field { name: \"c\", data_type: UInt32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, \ | ||
field_qualifiers: [Some(Bare { table: \"test\" }), Some(Bare { table: \"test\" }), Some(Bare { table: \"test\" })], \ | ||
functional_dependencies: FunctionalDependencies { deps: [] } }, \ | ||
"Optimizer rule 'get table_scan rule' failed\n\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original error actually is incorrect that it reports the reversed schemas (the "new schema" was actually the original schema)
ef189fb
to
d951289
Compare
d951289
to
a755338
Compare
Optimizer
to use TreeNode
APIOptimizer
to use TreeNode
API (10% faster planning)
52f3a54
to
2d5e154
Compare
2d5e154
to
a282d6a
Compare
a282d6a
to
a4cb731
Compare
Optimizer
to use TreeNode
API (10% faster planning)Optimizer
to use owned plans and TreeNode
API (10% faster planning)
} | ||
|
||
/// Recursively rewrites LogicalPlans | ||
struct Rewriter<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datafusion/optimizer/src/optimizer.rs
has all the important changes (to use the TreeNode API and stop copying)
&OptimizerContext::new(), | ||
)? | ||
.unwrap_or_else(|| plan.clone()); | ||
let optimized_plan = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the optimizer correctly applies rules recursively now, there is no need to explicitly call optimize recursively
)? | ||
.unwrap_or_else(|| plan.clone()); | ||
// Apply the rule once | ||
let opt_context = OptimizerContext::new().with_max_passes(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this some of the tests don't pass. By default the optimizer runs a few times until no changes are detected. Limiting to 1 pass mimics the previous test behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this. It seems that the previous code does not limit the pass to 1. Why do we need to limit it now to have the same behavior as the previous one? 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My explanation goes like: The loop that applies the rule more than once calls optimize_recursively
each time
This test only called optimze_recursively
once (directly) and thus the OptimizeRule
is only applied once
When I rewrote the test to use Optimizer::optimize
the loop will now kick in and so the OptimizeRule
will be run several times unless we set with_max_passes
This same reasoning applies to the other tests, but apparently they get the same answer when applied more than once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see!
@jackwener since you implemented some of the original optimizer recursion I wonder if you would have some time to review this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks pretty nice now!! 🚀
|
||
let result = match rule.apply_order() { | ||
// optimizer handles recursion | ||
Some(apply_order) => new_plan.rewrite(&mut Rewriter::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does rename it to rewrite_recurisvely
more straightforward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a generic api (so using rewrite makes sense to me), it's not introduced by this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got a question here.
In previous code, optimize_inputs
get the plan.inputs()
. how does rewrite
get the plan.inputs()
here? How does childnode
in rewrite
equals to plan.inputs()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see the map_children
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly
(eventually) calls map_children
: https://github.com/apache/arrow-datafusion/blob/cb21404bd3736ff9a6d8d443a67c64ece4c551a9/datafusion/common/src/tree_node.rs#L27-L31
This is the beauty of these TreeNode
APIs and @peter-toth's plan to make them all consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nice job to me, thanks @alamb!
Thank you very much @jackwener and @jayzhan211 for the reviews 🙏 |
Thanks @alamb for this work. |
Note: this looks like a large PR, but many of the changes are lines that removed
&
Which issue does this PR close?
Part of #9637 (stop copying
LogicalPlan
s in the Optimizer) and #8913 (unified TreeNode rewrite)Rationale for this change
The current structure of the
Optimizer
copiesLogicalPlan
s a large number of times. This is both slow as well as requires a large number of allocationsAfter #9999, the
TreeNode
API can handle rewritingLogicalPlan
efficiently withoutclone
.Thus it makes sense to use the TreeNode API in the optimizer, both because I think the code is simpler as well as to take advantage of the performance improvements in TreeNode API.
What changes are included in this PR?
Optimizer
to use TreeNode APIOptimizer::optimize
to take an owned LogicalPlan rather than force a copyAre these changes tested?
By existing CI
Performance benchmarks: Planning is 10% faster for TPCH, 13% faster for TPCDS
Details
Are there any user-facing changes?
There is a small API change:
Optimizer::optimize
now takes an ownedLogicalPlan
rather a reference (which forces a copy)Planned follow on task