Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add branch_id to distinguish between reusable branches and pipeline breakers #883

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

phofl
Copy link
Collaborator

@phofl phofl commented Feb 16, 2024

ref #854

Couple of thoughts here:

  • enabling / disabling reuse of branches seems like a completely different thing compared to what we do in simplify, so it felt wrong adding it in here and we would need more special casing on top of it. I want an easy way to opt out of this, so having a more central approach seems a lot better

  • branch_id needs to be available on every expression, so that we can move this around freely without worrying if the expression supports it.

  • it has to be available in new, because it influences how the expression tokenises.

  • I decided to bubble the expression step by step but not keep it on intermediate layers (e.g. non Reductions and non IO layers). We would have to consider the branch_id in every lower and simplify operation to avoid dropping it in between, which introduces a lot of uncertainty and mental lode which didn't seem to be worth the effort. The step by step approach gives us more flexibility how we handle other Expressions that can consume the branch_id (thinking about merges and shuffles at the moment)

This is not ready to merge, it needs more tests and support for merges and shuffles at least, maybe more. But it does what it's supposed to now

@@ -107,6 +107,7 @@ def test_groupby_reduction_optimize(pdf, df):
ops = [
op for op in agg.expr.optimize(fuse=False).walk() if isinstance(op, FromPandas)
]
agg.simplify().pprint()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug print?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, thx

@@ -2758,24 +2767,37 @@ def optimize(expr: Expr, fuse: bool = True) -> Expr:
Input expression to optimize
fuse:
whether or not to turn on blockwise fusion
common_subplan_elimination : bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, our default is actually a very radical form of "common subplan elimination". I think this feature is actually quite the opposite and rather something like "replicate common subplan"? I guess other engines are doing it the other way round, aren't they?

Copy link
Collaborator Author

@phofl phofl Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took the name from here: https://docs.pola.rs/py-polars/html/reference/lazyframe/api/polars.LazyFrame.explain.html

But I don't have strong feelings either way

And should have read the actual documentation, yes you are correct this should be the other way round

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming is hard. @hendrikmakait any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I'm late to the party. dask-expr approaches CSE from the other end, defaulting to as much CSE as possible. This makes it less intuitive. As mentioned in #896, common_subplan_elimination=False does not mean that we don't do any CSE but instead that we try to reduce it. I'm fine with the argument name for now that you've cleaned it up, let's see if this leads to some confusion in the future.

Copy link
Member

@hendrikmakait hendrikmakait Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding Common Subexpression Elimination vs Common Subplan Elimination: I think Common Subexpression Elimination is more common for data systems and compilers/IR-based systems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly. polars seems to have both subexpression (https://github.com/pola-rs/polars/blob/128803b237dc13d0522c22dbccae1257ae30477e/crates/polars-plan/src/logical_plan/optimizer/cse_expr.rs) as well as subplan elimination (https://github.com/pola-rs/polars/blob/128803b237dc13d0522c22dbccae1257ae30477e/crates/polars-plan/src/logical_plan/optimizer/cse.rs).

I'm not entirely sure what the difference is, but from what I understand cse_expr deals with eliminating duplication within a single expressions (a.sum() + b.sum() + a.sum()) and cse deals with what eliminating subgraphs across the entire graph.

That is, cse_expr corresponds to local CSE and cse to global CSE in the link Florian shared.

Comment on lines 64 to 66
@functools.cached_property
def _branch_id(self):
return self.operands[-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just storing this as an attribute? Why does this need to be an operand?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having it in the operands makes sure that it is used for _name, which is what we mostly care about. If that is overridden and forgotten, then the whole thing won't work properly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that. I'm just concerned this is messing with other things since operands are everywhere. If it works I'm fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to "avoid" the kind of confusion I'm thinking about you introduced argument_opearands but this may also be confusing to developers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants