[WIP] Allow expressions to be shipped to the scheduler #294

fjetter · 2023-09-08T12:49:37Z

Given the following two PRs, these minor changes would allow us to pass the expressions to the scheduler via pickle

fjetter · 2023-09-08T12:50:06Z

dask_expr/_collection.py

+    def __dask_tokenize__(self):
+        return self.expr._name


This is just defined as part of the collections protocol. Not sure if it is actually required

fjetter · 2023-09-08T12:51:44Z

dask_expr/_collection.py

-    def __dask_keys__(self):
-        out = self.expr
-        out = out.lower_completely()
-        return out.__dask_keys__()


Having keys defined on the collections level feels like an abstraction leak. Among other things this is what threw me off for a while when implementing this the first time. I had a hard time distinguishing collections from graphs in the existing code. I find this now a bit clearer in the above PRs

fjetter · 2023-09-08T12:53:02Z

dask_expr/_collection.py

+    def finalize_compute(self) -> FrameBase:
+        from ._repartition import RepartitionToFewer
+        if self.npartitions > 1:
+            return new_collection(RepartitionToFewer(self.expr, 1))
+        return self


This is not really new but instead of doing the graph mutations ourselves by calling postcompute, etc. in the client or in dask, we are now effectively just modifying the expression to what the end result should look like. In this case, it's just a concat of all partitions

fjetter · 2023-09-08T12:54:14Z

dask_expr/_collection.py

+    def postpersist(self, futures: dict) -> NewDaskCollection:
+        return from_graph(futures, self._meta, self.divisions, self._name)


This is also not really new but I moved away from the "method that returns a callable [...]" API to just calling the callable. I haven't checked with any downstream projects, yet, if this complex callable construct is actually required. Not married to anything here.

dask_expr/_expr.py

fjetter · 2023-09-08T12:55:19Z

dask_expr/_expr.py

@@ -699,8 +702,10 @@ def dtypes(self):
    def _meta(self):
        raise NotImplementedError()

-    def __dask_graph__(self):
+    def _materialize(self):


The underscore is probably a bit silly. Will likely change this again

fjetter · 2023-09-08T12:55:46Z

dask_expr/_expr.py

+from dask.typing import DaskGraph
+# Note: subclassing isn't required. This is just for the prototype to have a
+# check for abstractmethods but the runtime checks for duck-typing/protocol only
+class Expr(DaskGraph):


as the note says, there is no need for subclassing. this is just for prototyping (or we keep it, no strong preferences)

fjetter · 2023-09-08T12:56:46Z

CI is obviously failing because I haven't set up CI to run against dask / distributed. I'll clean this up soon on the other PRs as well

fjetter · 2023-09-08T13:04:33Z

There is only one caveat so far: Since we're delaying materialization until the graph is on the scheduler and since we don't have a concept of merge(collection1, collection2) in dask-expr yet (basically what dask.base.collections_to_dsk is doing) we have to submit expressions sequentially to the scheduler, i.e.

dask.compute(collection1, collection2) will generate two requests to the scheduler.

This has the shortcoming that we will not be performing a "multi-collections" optimization and we will not order the entire graph at once. We will still deduplicate keys that are in both collections.

I find this approach a decent compromise to get started. If we notice this to be a severe problem we can think about a merge operation of some sorts.

Edit: The fact that we're calling dask.order on the two above collections separately may cause the performance of the combined operation to be sensitive to ordering. I do suspect, though, that this is mostly an academic problem and real world use cases should not be impacted. This requires a bit of testing.

mrocklin · 2023-09-08T16:25:33Z

There is only one caveat so far: Since we're delaying materialization until the graph is on the scheduler and since we don't have a concept of merge(collection1, collection2) in dask-expr

What if we made a Tuple(Expr) term? Would that resolve this issue? (I haven't thought deeply about what's going on here yet, just throwing out ideas)

crusaderky · 2023-12-22T11:59:13Z

dask_expr/_collection.py

    @property
    def dask(self):
-        return self.__dask_graph__()
+        # FIXME: This is highly problematic. Defining this as a property can
+        # cause very unfortunate materializations. Even a mere hasattr(obj,
+        # "dask") check already triggers this since it's a property, not even a
+        # method.
+        return self.__dask_graph_factory__().optimize().materialize()


What happens if we remove it completely? I guess some bits in xarray and pint will break but there will be work to be done in xarray/pint anyway so I don't see a problem.

crusaderky · 2023-12-22T12:03:50Z

dask_expr/_collection.py

+        return self.__dask_graph_factory__().optimize().materialize()
+
+    def finalize_compute(self):
+        return new_collection(Repartition(self.expr, 1))


This is very problematic in the fairly common use case where the client mounts a lot more memory than a single worker. This forces the whole object to be unnecessarily collected onto a single worker and then sent to the client, whereas we could just have the client fetch separate partitions from separate workers (which may or may not happen all at once if it needs to transit through the scheduler).

This replicates the issue with the current finalizer methods in dask/dask, which are created by dask.compute(df) but are skipped by df.compute().

Memory considerations aside, bouncing through a single worker instead of collecting it on the client directly is adding latency.

this is exactly how it is done right now and I don't intend to touch that behavior now

fjetter

had some staged comments...

fjetter · 2023-12-20T15:31:03Z

dask_expr/_expr.py

+class Tuple(Expr):
+    def __getitem__(self, other):
+        return self.operands[other]
+
+    def _layer(self) -> dict:
+        return toolz.merge(op._layer() for op in self.operands)
+
+    def __dask_output_keys__(self) -> list:
+        return list(flatten(op.__dask_output_keys__() for op in self.operands))
+
+    def __len__(self):
+        return len(self.operands)
+
+    def __iter__(self):
+        return iter(self.operands)


From what I can tell, the tuple works out well enough for our purpose here.

fjetter · 2023-12-20T15:32:06Z

dask_expr/_expr.py

+    @classmethod
+    def combine_factories(cls, *exprs: Expr, **kwargs) -> Expr:
+        return Tuple(*exprs)


This is mostly syntactic sugar and I don't know if I want to keep this. I see a way forward to just move HLGs and old-style collections to the new protocol and nuke a lot of compat code. In this case, such a hook here would be useful. For now, you can ignore this

fjetter · 2023-12-20T15:33:21Z

dask_expr/_collection.py

    def dask(self):
-        return self.__dask_graph__()
+        # FIXME: This is highly problematic. Defining this as a property can
+        # cause very unfortunate materializations. Even a mere hasattr(obj,
+        # "dask") check already triggers this since it's a property, not even a
+        # method.
+        return self.__dask_graph_factory__().optimize().materialize()


I guess this is the most controversial thing. I would actually like to throw this out since it triggers materialization soooo easily. Even an attribute lookup is sufficient to do this.

FWIW I encountered similar problems with the divisions property which triggers not only materialization but quantile computation!

fjetter · 2023-12-22T12:46:17Z

dask_expr/_collection.py

+        return self.__dask_graph_factory__().optimize().materialize()
+
+    def finalize_compute(self):
+        return new_collection(Repartition(self.expr, 1))


this is exactly how it is done right now and I don't intend to touch that behavior now

fjetter · 2023-12-22T16:15:42Z

Brief update: This is mostly working. There are still failing tests I have to track down. Most of the things I had to fix so far are somehow coupled to me adding a Repartition expression as a postcompute which changes how the graph is handled compared to how it is done right now. As a reminder, right now it is materialized and this concat/repartition task is added to the low level graph.

This reveals stuff like dask/dask#10739

I will likely want to nuke the dask property. In the test suite this does all kinds of weird things like triggering quantile computations during an hasattr check. I don't think this is user friendly. This isn't a decision that has to be done on this PR, though and I suggest to postpone this to a follow up PR. This one with the siblings in dask and distributed will be big enough once everything is done

fjetter commented Sep 8, 2023

View reviewed changes

fjetter mentioned this pull request Sep 8, 2023

[WIP] Pass non-HLG objects wout materialization dask/distributed#7942

Closed

phofl mentioned this pull request Oct 10, 2023

Make simplify and lower optional within Expr.optimize #326

Closed

fjetter force-pushed the expr_to_scheduler branch from 45e5048 to 81b00b5 Compare December 6, 2023 14:54

This was referenced Dec 18, 2023

Global optimization #582

Open

Make dask.array.utils functions more generic to other Dask Arrays dask/dask#10676

Merged

fjetter force-pushed the expr_to_scheduler branch 2 times, most recently from 93a0fd1 to 3c32fda Compare December 20, 2023 14:53

crusaderky reviewed Dec 22, 2023

View reviewed changes

fjetter commented Dec 22, 2023

View reviewed changes

fjetter mentioned this pull request Dec 22, 2023

Use _concat for repartition #626

Closed

fjetter force-pushed the expr_to_scheduler branch from 60b3547 to a7def84 Compare December 22, 2023 16:10

fjetter force-pushed the expr_to_scheduler branch from a7def84 to 331193d Compare January 10, 2024 10:41

fjetter added 3 commits January 16, 2024 12:54

Allow expressions to be shipped to the scheduler

df36f7c

multiple fixes

56487cd

add make_meta_dispatch

e1b189d

fjetter force-pushed the expr_to_scheduler branch from 331193d to e1b189d Compare January 16, 2024 12:17

fjetter mentioned this pull request Jan 22, 2024

[Bug] Optimization is now much slower in TPCh benchmarks #796

Closed

fjetter closed this Mar 21, 2025

		def postpersist(self, futures: dict) -> NewDaskCollection:
		return from_graph(futures, self._meta, self.divisions, self._name)

Uh oh!

[WIP] Allow expressions to be shipped to the scheduler #294

[WIP] Allow expressions to be shipped to the scheduler #294

Uh oh!

Conversation

fjetter commented Sep 8, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter commented Sep 8, 2023

Uh oh!

fjetter commented Sep 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Sep 8, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky Dec 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter Dec 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter Dec 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter commented Dec 22, 2023

Uh oh!

Uh oh!

fjetter commented Sep 8, 2023 •

edited

Loading

crusaderky Dec 22, 2023 •

edited

Loading

fjetter Dec 22, 2023 •

edited

Loading

fjetter Dec 22, 2023 •

edited

Loading