-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Allow expressions to be shipped to the scheduler #294
base: main
Are you sure you want to change the base?
Conversation
def __dask_tokenize__(self): | ||
return self.expr._name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just defined as part of the collections protocol. Not sure if it is actually required
def __dask_keys__(self): | ||
out = self.expr | ||
out = out.lower_completely() | ||
return out.__dask_keys__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having keys defined on the collections level feels like an abstraction leak. Among other things this is what threw me off for a while when implementing this the first time. I had a hard time distinguishing collections from graphs in the existing code. I find this now a bit clearer in the above PRs
dask_expr/_collection.py
Outdated
def finalize_compute(self) -> FrameBase: | ||
from ._repartition import RepartitionToFewer | ||
if self.npartitions > 1: | ||
return new_collection(RepartitionToFewer(self.expr, 1)) | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really new but instead of doing the graph mutations ourselves by calling postcompute, etc. in the client or in dask, we are now effectively just modifying the expression to what the end result should look like. In this case, it's just a concat of all partitions
dask_expr/_collection.py
Outdated
def postpersist(self, futures: dict) -> NewDaskCollection: | ||
return from_graph(futures, self._meta, self.divisions, self._name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also not really new but I moved away from the "method that returns a callable [...]" API to just calling the callable. I haven't checked with any downstream projects, yet, if this complex callable construct is actually required. Not married to anything here.
dask_expr/_expr.py
Outdated
@@ -699,8 +702,10 @@ def dtypes(self): | |||
def _meta(self): | |||
raise NotImplementedError() | |||
|
|||
def __dask_graph__(self): | |||
def _materialize(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The underscore is probably a bit silly. Will likely change this again
dask_expr/_expr.py
Outdated
from dask.typing import DaskGraph | ||
# Note: subclassing isn't required. This is just for the prototype to have a | ||
# check for abstractmethods but the runtime checks for duck-typing/protocol only | ||
class Expr(DaskGraph): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as the note says, there is no need for subclassing. this is just for prototyping (or we keep it, no strong preferences)
CI is obviously failing because I haven't set up CI to run against dask / distributed. I'll clean this up soon on the other PRs as well |
There is only one caveat so far: Since we're delaying materialization until the graph is on the scheduler and since we don't have a concept of
This has the shortcoming that we will not be performing a "multi-collections" optimization and we will not order the entire graph at once. We will still deduplicate keys that are in both collections. I find this approach a decent compromise to get started. If we notice this to be a severe problem we can think about a Edit: The fact that we're calling dask.order on the two above collections separately may cause the performance of the combined operation to be sensitive to ordering. I do suspect, though, that this is mostly an academic problem and real world use cases should not be impacted. This requires a bit of testing. |
What if we made a |
45e5048
to
81b00b5
Compare
93a0fd1
to
3c32fda
Compare
dask_expr/_collection.py
Outdated
@property | ||
def dask(self): | ||
return self.__dask_graph__() | ||
# FIXME: This is highly problematic. Defining this as a property can | ||
# cause very unfortunate materializations. Even a mere hasattr(obj, | ||
# "dask") check already triggers this since it's a property, not even a | ||
# method. | ||
return self.__dask_graph_factory__().optimize().materialize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we remove it completely? I guess some bits in xarray and pint will break but there will be work to be done in xarray/pint anyway so I don't see a problem.
return self.__dask_graph_factory__().optimize().materialize() | ||
|
||
def finalize_compute(self): | ||
return new_collection(Repartition(self.expr, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very problematic in the fairly common use case where the client mounts a lot more memory than a single worker. This forces the whole object to be unnecessarily collected onto a single worker and then sent to the client, whereas we could just have the client fetch separate partitions from separate workers (which may or may not happen all at once if it needs to transit through the scheduler).
This replicates the issue with the current finalizer methods in dask/dask, which are created by dask.compute(df)
but are skipped by df.compute()
.
Memory considerations aside, bouncing through a single worker instead of collecting it on the client directly is adding latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is exactly how it is done right now and I don't intend to touch that behavior now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
had some staged comments...
dask_expr/_expr.py
Outdated
class Tuple(Expr): | ||
def __getitem__(self, other): | ||
return self.operands[other] | ||
|
||
def _layer(self) -> dict: | ||
return toolz.merge(op._layer() for op in self.operands) | ||
|
||
def __dask_output_keys__(self) -> list: | ||
return list(flatten(op.__dask_output_keys__() for op in self.operands)) | ||
|
||
def __len__(self): | ||
return len(self.operands) | ||
|
||
def __iter__(self): | ||
return iter(self.operands) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell, the tuple works out well enough for our purpose here.
@classmethod | ||
def combine_factories(cls, *exprs: Expr, **kwargs) -> Expr: | ||
return Tuple(*exprs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly syntactic sugar and I don't know if I want to keep this. I see a way forward to just move HLGs and old-style collections to the new protocol and nuke a lot of compat code. In this case, such a hook here would be useful. For now, you can ignore this
dask_expr/_collection.py
Outdated
def dask(self): | ||
return self.__dask_graph__() | ||
# FIXME: This is highly problematic. Defining this as a property can | ||
# cause very unfortunate materializations. Even a mere hasattr(obj, | ||
# "dask") check already triggers this since it's a property, not even a | ||
# method. | ||
return self.__dask_graph_factory__().optimize().materialize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is the most controversial thing. I would actually like to throw this out since it triggers materialization soooo easily. Even an attribute lookup is sufficient to do this.
FWIW I encountered similar problems with the divisions
property which triggers not only materialization but quantile computation!
return self.__dask_graph_factory__().optimize().materialize() | ||
|
||
def finalize_compute(self): | ||
return new_collection(Repartition(self.expr, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is exactly how it is done right now and I don't intend to touch that behavior now
60b3547
to
a7def84
Compare
Brief update: This is mostly working. There are still failing tests I have to track down. Most of the things I had to fix so far are somehow coupled to me adding a This reveals stuff like dask/dask#10739 I will likely want to nuke the |
a7def84
to
331193d
Compare
331193d
to
e1b189d
Compare
Given the following two PRs, these minor changes would allow us to pass the expressions to the scheduler via pickle
cc @phofl