-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add branch_id
to distinguish between reusable branches and pipeline breakers
#883
base: main
Are you sure you want to change the base?
Conversation
dask_expr/tests/test_groupby.py
Outdated
@@ -107,6 +107,7 @@ def test_groupby_reduction_optimize(pdf, df): | |||
ops = [ | |||
op for op in agg.expr.optimize(fuse=False).walk() if isinstance(op, FromPandas) | |||
] | |||
agg.simplify().pprint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug print?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, thx
dask_expr/_expr.py
Outdated
@@ -2758,24 +2767,37 @@ def optimize(expr: Expr, fuse: bool = True) -> Expr: | |||
Input expression to optimize | |||
fuse: | |||
whether or not to turn on blockwise fusion | |||
common_subplan_elimination : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not mistaken, our default is actually a very radical form of "common subplan elimination". I think this feature is actually quite the opposite and rather something like "replicate common subplan"? I guess other engines are doing it the other way round, aren't they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took the name from here: https://docs.pola.rs/py-polars/html/reference/lazyframe/api/polars.LazyFrame.explain.html
But I don't have strong feelings either way
And should have read the actual documentation, yes you are correct this should be the other way round
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also https://en.wikipedia.org/wiki/Common_subexpression_elimination with a simple example =)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming is hard. @hendrikmakait any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that I'm late to the party. dask-expr
approaches CSE from the other end, defaulting to as much CSE as possible. This makes it less intuitive. As mentioned in #896, common_subplan_elimination=False
does not mean that we don't do any CSE but instead that we try to reduce it. I'm fine with the argument name for now that you've cleaned it up, let's see if this leads to some confusion in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding Common Subexpression Elimination vs Common Subplan Elimination: I think Common Subexpression Elimination is more common for data systems and compilers/IR-based systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly. polars
seems to have both subexpression (https://github.com/pola-rs/polars/blob/128803b237dc13d0522c22dbccae1257ae30477e/crates/polars-plan/src/logical_plan/optimizer/cse_expr.rs) as well as subplan elimination (https://github.com/pola-rs/polars/blob/128803b237dc13d0522c22dbccae1257ae30477e/crates/polars-plan/src/logical_plan/optimizer/cse.rs).
I'm not entirely sure what the difference is, but from what I understand cse_expr
deals with eliminating duplication within a single expressions (a.sum() + b.sum() + a.sum()
) and cse
deals with what eliminating subgraphs across the entire graph.
That is, cse_expr
corresponds to local CSE and cse
to global CSE in the link Florian shared.
dask_expr/_core.py
Outdated
@functools.cached_property | ||
def _branch_id(self): | ||
return self.operands[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just storing this as an attribute? Why does this need to be an operand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having it in the operands makes sure that it is used for _name
, which is what we mostly care about. If that is overridden and forgotten, then the whole thing won't work properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that. I'm just concerned this is messing with other things since operands
are everywhere. If it works I'm fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to "avoid" the kind of confusion I'm thinking about you introduced argument_opearands but this may also be confusing to developers
…ation_2 # Conflicts: # dask_expr/_core.py # dask_expr/_expr.py
…ation # Conflicts: # dask_expr/tests/test_shuffle.py
ref #854
Couple of thoughts here:
enabling / disabling reuse of branches seems like a completely different thing compared to what we do in simplify, so it felt wrong adding it in here and we would need more special casing on top of it. I want an easy way to opt out of this, so having a more central approach seems a lot better
branch_id needs to be available on every expression, so that we can move this around freely without worrying if the expression supports it.
it has to be available in new, because it influences how the expression tokenises.
I decided to bubble the expression step by step but not keep it on intermediate layers (e.g. non Reductions and non IO layers). We would have to consider the branch_id in every lower and simplify operation to avoid dropping it in between, which introduces a lot of uncertainty and mental lode which didn't seem to be worth the effort. The step by step approach gives us more flexibility how we handle other Expressions that can consume the branch_id (thinking about merges and shuffles at the moment)
This is not ready to merge, it needs more tests and support for merges and shuffles at least, maybe more. But it does what it's supposed to now