Expr as singleton #798

fjetter · 2024-01-23T11:31:16Z

Closes #796
Closes #800

the attribute checkup in Expr.rewrite is expensive since it checks for an element in a list. This cae4df1 commit removes that check and simply ensures the two rewrites that are currently used exists. I think this would be a sensible change but a proper PR should do this in a clean fashion. This removes all issues with the rewrite stack in my testing
This one is a little more controversial. I'm ensuring that the Expr objects are in fact singletons by implementing a __new__ method. This ensures that there can always ever by one Expr with a given _name (that is based on tokenization)

A couple of observations

The performance of optimize improves by roughly an order of magnitude down from 10-12s to 1.12s on my machine.
There are two tests that fail with this. I strongly suspect that this is due to improper tokenization since I haven't finished Non deterministic tokenization for empty numpy arrays after pickle roundtrip dask#10799 yet. I will use this test case to root out another broken token...
1.12s is about as good as it gets with this but this still feels way too slow for such a small example... (I'll post more detailed measurements shortly)

mrocklin · 2024-01-23T13:48:21Z

In SymPy we switched to using __new__ for performance reasons. There's good prior art here I think.

mrocklin · 2024-01-23T13:57:16Z

dask_expr/_core.py

-        if self._required_attribute:
-            dep = next(iter(self.dependencies()))._meta
-            if not hasattr(dep, self._required_attribute):
+        _dependencies = [operand for operand in operands if isinstance(operand, Expr)]


Why not make this a cached property? I think that ideally we have as little computed state as possible.

Agree that we shouldn't need to "attach" _dependencies to the Expr class here, but that information is being used to check _required_attribute below anyway.

I don't think we really need the _required_attribute check anymore if we update the Expr.__getattr__ to avoid throwing an AttributeError when any metadata calculation (._meta, ._meta_chunk, etc.) fails. I'll submit a simple PR to strip out the _required_attribute stuff.

I was just lazy and don't really care what we do with _dependencies. Personally, I consider the cached property an indirection and a little more difficult to grasp but I don't have strong opinions here. We have to compute it either way at this point so it's just a matter of code organization.

Getting rid of the _required_attribute check would of course simplify this quite a bit

This PR removes _required_atttribute: #799

(not that removing that attribute solves anything here - just nice to remove unnecessary code)

fjetter · 2024-01-23T17:27:52Z

Pushed a version of the caching as I proposed in #797 (comment) but I'm using an attribute instead of a list with the key. With this, the TPC Q1 optimization is down to 200ms

Edit: removed that again since the caching is not straight forward, see #798 (comment) for a discussion

dask_expr/_core.py

fjetter · 2024-01-24T13:00:32Z

Well, the singleton approach as suggested here will clash pretty hard with the way caching is implemented on the ReadParquet expression.

In essence, the tokenization of the expression class does not reliably tie the expression to the dataset to be read (the path is not sufficient, a modified timestamp or etag should be part of this as well). If we reuse instances, this will fail hard as soon as the dataset is mutated.
To a lesser extend, the same is of course true for all IO layers but it is particularly bad since the cached properties are also backed by a global cache for the case of parquet.

rjzamora · 2024-01-24T15:13:10Z

In essence, the tokenization of the expression class does not reliably tie the expression to the dataset to be read (the path is not sufficient, a modified timestamp or etag should be part of this as well). If we reuse instances, this will fail hard as soon as the dataset is mutated.

Ah, interesting. The global caching component of ReadParquet is actually not a problem at all. The problem is entirely caused by in the cached _dataset_info property. The global cache is always cleared when to_parquet is called, but the objects local _dataset_info cache is not cleared. (sorry if this is obvious to you - just sharing my understanding)

EDIT: I guess the global cache is only cleared if overwrite is specified. My mistake

fjetter · 2024-02-01T11:17:43Z

dask_expr/io/parquet.py

+        for file in files_for_checksum:
+            # The checksum / file info is usually already cached by the fsspec
+            # FileSystem dir_cache since this info was already asked for in
+            # _collect_dataset_info
+            checksum.append(fs.checksum(file))
+        dataset_info["checksum"] = tokenize(checksum)


To deal with the cache consistency problem described in #800 I am calculating a checksum here. For s3 this falls back to using the ETag provided in the listdir response. This should not add any overhead since this stuff is already cached by fsspec.
We're either taking the checksum of the metadata file or of all files that we iterate over. At this point this listdir operation is already done so the checksum identifies every dataset uniquely. Since adding this checksum to the dataset_info, this also guarantees that the cache for the plan is invalidated if the dataset changes.

fjetter · 2024-02-01T11:21:09Z

dask_expr/io/parquet.py

@@ -432,6 +416,7 @@ class ReadParquet(PartitionsFiltered, BlockwiseIO):
        "kwargs": None,
        "_partitions": None,
        "_series": False,
+        "_dataset_info_cache": list,


The new concept here is that I am moving off a global cache. The dataset_info is always calculated whenever a user calls read_parquet(foo) and will therefore always receive an accurate representation of the dataset at the time this is called.
This dataset_info is cached in this paramter. I am choosing a list as a container but this could be anything. I could also just set the operand and mutate the expression in place.

The benefit of using a paramter for this cache is that the cache will naturally propagate to all derived instances, e.g. whenever we rewrite the expression using Expr.substitute_parameters. This allows us to maintain the cache during optimization and it ties the lifetime of the cache to the lifetime of the expression ancestry removing any need for us to invalidate the cache ever.

fjetter · 2024-02-01T11:22:52Z

dask_expr/io/parquet.py

+    def _name(self):
+        return (
+            funcname(type(self)).lower()
+            + "-"
+            + _tokenize_deterministic(self.checksum, *self.operands)
+        )


this checksum is part of the _name allowing us to differentiate expressions that point to modified states of the dataset. It also allows us to reuse already cached "plans / divisions" if the dataset did not change which is the most common case

fjetter · 2024-02-01T11:24:16Z

dask_expr/io/parquet.py

+    def checksum(self):
+        return self._dataset_info["checksum"]
+
+    @property
    def _dataset_info(self):
+        if rv := self.operand("_dataset_info_cache"):
+            return rv[0]


When the ReadParquet expression is initialized first during a read_parquet this cache is empty and we'll fetch the dataset_info essentially during construction time of the expression object.

Subsequent expressions that are derived which are inheriting the cache will just access this making the __new__ call instantaneous.

fjetter · 2024-02-01T12:37:12Z

interesting, the dask/dask tests are running into tokenization issues while the dask-expr tests are all green. Will have to dig into this a little

fjetter · 2024-02-01T13:26:03Z

ok, this one requires two minor fixes in dask/dask, see dask/dask#10880

fjetter · 2024-02-01T14:41:30Z

Well, the performance is better but still not where I'd like it to be

We have a very consistent improvement with this PR but some of the queries appear to still scratch the 1s mark.

The reason for this is that the current branch uses a weakvalue dictionary to determine whether an object already exists but it looks like the expr objects are going out of scope too quickly such that the instances are indeed being recreated frequently. Looking at Q1, the main branch creates about 20k new expression instances and with this branch I can half this. However, over the lifetime of the optimization there are only about 500 unique expressions.
Switching from a weakvalue dictionary to something more persistent is currently also not easily possible since the collect_dependents mechanism does rely on the weakref semantics.

phofl · 2024-02-01T15:33:35Z

We should be able to remove the weak refs if we keep track of the "number of references" ourselves on top of this PR

fjetter · 2024-02-01T17:00:56Z

If we combine this approach with #831 we'll get down further

(higher is better here)

The one thing that is not budging is Q21. Looking into that, it appears that there is an insane amount of substitution happening during blockwise fusion

phofl · 2024-02-01T17:06:27Z

21 is because the 3 filters blow up the number of expressions, similar to what assign did in merge in the past. I am almost done with the filter pushdown in merges which will remove those problems entirely

fjetter · 2024-02-01T17:06:59Z

With both changes, Q1 optimize is about as fast as it can get at around 500ms. Further speedup is possible by improving tokenization, simplifying the attribute lookup for Expr and speeding up convergence of the optimizer overall to require fewer steps. Even the meta computation is rather negligible at this point. I think 500ms is fine (for now) but I'd like to understand the substitude thing a little better.

fjetter · 2024-02-01T17:08:03Z

21 is because the 3 filters blow up the number of expressions, similar to what assign did in merge in the past. I am almost done with the filter pushdown in merges which will remove those problems entirely

There are around ~700 unique expressions in that query during optimization. That's not that much (compared to ~500 for Q1)

fjetter · 2024-02-02T10:59:23Z

ok, so I looked into the Q21 performance which has nothing to do with caching but is an algorithmic problem in the recursion of substitute. I documented this in #835

fjetter · 2024-02-02T15:45:18Z

dask_expr/io/io.py

@@ -36,7 +36,7 @@ class FromGraph(IO):
    conversion from legacy dataframes.
    """

-    _parameters = ["layer", "_meta", "divisions", "_name"]
+    _parameters = ["layer", "_meta", "divisions", "keys", "name_prefix"]


The FromGraph thing got a little odd. For context, this thing is used in two ways

As a wrapper for futures when persist is called

As a wrapper for legacy dataframes

The previous implementation accepted the _name as an input argument. For the persisted dataframe, this was the name of the original expression. For the wrapped one, it is the name of the legacy dataframe.

The problem is that the legacy dataframe does not have the same strict uniqueness guarantees as the expressions do so setting duplicate names is very easy. In fact, our tests where doing just that! (and still are). This caused the Expr.__new__ to deduplicate and effectively ignore the second dataframe... oops.

For persist it is also a little odd since if the FromGraph expression is inheriting the exact name of it's ancestor, there exist now two expressions of a different type with the same name. This is odd.

Effectively, with this chosen singleton implementation, setting the _name explicitly instead of calculating it using a hash is a cardinal sin and will cause all sorts of weird things to happen.

Now, this can be fixed but that has a rather big caveat. If I have to redefine the name of the expression, I actually also have to rewrite the graph! Many places in dask-expr are (imo incorrectly) assuming that output key names of a dataframe layer/expression are universally built as (df._name, i) and are hard coding this when implementing their own layer (instead of iterating over i, iterating over df.__dask_keys__() would maintain the abstraction). This rewrite adds effectively another layer of keys. In reality this is really ugly since when computing something on top of a persisted dataframe, there will always be this dummy key in between.

Alternatively, I could make the singleton deduplication type aware to give the FromGraph thing an excuse to overwrite the name. However, if we truly stick with singletons that are based on the name, I would prefer the name to actually be unique which required all implementations to stop hard coding keys of another expression/dataframe and iterate properly over the __dask_keys__

Now, this can be fixed but that has a rather big caveat. If I have to redefine the name of the expression, I actually also have to rewrite the graph! Many places in dask-expr are (imo incorrectly) assuming that output key names of a dataframe layer/expression are universally built as (df._name, i) and are hard coding this when implementing their own layer (instead of iterating over i, iterating over df.dask_keys() would maintain the abstraction). This rewrite adds effectively another layer of keys. In reality this is really ugly since when computing something on top of a persisted dataframe, there will always be this dummy key in between.

This is a good point, we should fix this instead of relying on df._name and i

I will investigate if it is possible to just use __dask_keys__ everywhere but I'd prefer doing this in a follow up

Yes totally agree, this should definitely be a follow up (also doesn't have to be you, I could pick this up as well)

I got started on this, see

c3c01ed

That commit replaces all occurrences that match \((\w+(_\w+)?)(?<!self)\._name, but something appears to be still missing. It is possible and not as ugly as I thought it would be

fjetter · 2024-02-02T15:46:40Z

dask_expr/io/io.py

    def _layer(self):
-        return dict(self.operand("layer"))
+        dsk = dict(self.operand("layer"))
+        # The name may not actually match the layers name therefore rewrite this
+        # using an alias
+        for part, k in enumerate(self.operand("keys")):
+            dsk[(self._name, part)] = k
+        return dsk


The intuitive fix for me would've been to overwrite __dask_keys__ instead to point to the appropriate keys but as I explained above, the implementation of Expr.__dask_keys__ is hard coded in many places.

fjetter · 2024-02-02T15:49:49Z

I opened #838 which points the dask/dask tests to dask/dask#10880 such that we can see if CI is happy

fjetter · 2024-02-02T16:59:53Z

CI when pointing to dask/dask with the fixes #838 is green

phofl · 2024-02-02T17:58:54Z

thx @fjetter

fjetter mentioned this pull request Jan 23, 2024

[Bug] Optimization is now much slower in TPCh benchmarks #796

Closed

fjetter changed the title ~~Expr as singleton + Ensure rewrite methods always exist~~ [DNM] Expr as singleton + Ensure rewrite methods always exist Jan 23, 2024

fjetter force-pushed the try_expr_singleton branch from 9645a65 to cae4df1 Compare January 23, 2024 13:28

mrocklin reviewed Jan 23, 2024

View reviewed changes

fjetter mentioned this pull request Jan 23, 2024

Add caching to recursive simplify_once calls #797

Merged

fjetter force-pushed the try_expr_singleton branch from cc1470c to 19bede2 Compare January 23, 2024 17:30

rjzamora reviewed Jan 23, 2024

View reviewed changes

dask_expr/_core.py Outdated Show resolved Hide resolved

fjetter force-pushed the try_expr_singleton branch from 19bede2 to cae4df1 Compare January 24, 2024 11:02

fjetter mentioned this pull request Jan 24, 2024

Parquet Dataset cache not reliable #800

Closed

fjetter added 4 commits February 1, 2024 12:06

Expr as singleton

0a9617f

Ensure rewrite methods always exist

8433deb

calc checksum for datasets and tie cache to expr ancestry

ed93aa7

fix merge conflicts

a24c6c9

fjetter force-pushed the try_expr_singleton branch from b90badf to a24c6c9 Compare February 1, 2024 11:11

fjetter changed the title ~~[DNM] Expr as singleton + Ensure rewrite methods always exist~~ Expr as singleton Feb 1, 2024

fjetter commented Feb 1, 2024

View reviewed changes

fjetter marked this pull request as ready for review February 1, 2024 11:24

fjetter added 2 commits February 1, 2024 14:01

deal with weird paths

f8b9f4e

restore invalidate cache

129aa7a

fjetter mentioned this pull request Feb 1, 2024

A couple of dask-expr fixes for new parquet cache dask/dask#10880

Merged

fjetter mentioned this pull request Feb 1, 2024

[DNM/RFC] Track dependents without weakrefs #831

Draft

fjetter mentioned this pull request Feb 2, 2024

Exponential runtime complexity for Expr.substitute #835

Closed

fjetter mentioned this pull request Feb 2, 2024

Avoid recursing into already substituted expressions #836

Merged

fjetter added 2 commits February 2, 2024 15:35

refactor FromGraph to tokenize operands

cb85ffc

ensure name prefix stays the same

4f8e0e3

fjetter commented Feb 2, 2024

View reviewed changes

fjetter mentioned this pull request Feb 2, 2024

[DNM] [CI] Singleton plus daskdask fixes #838

Closed

phofl approved these changes Feb 2, 2024

View reviewed changes

phofl merged commit 8f99b0b into dask:main Feb 2, 2024
5 of 7 checks passed

fjetter deleted the try_expr_singleton branch February 7, 2024 09:12

fjetter mentioned this pull request Feb 8, 2024

Parquet checksum calculation horribly slow with arrow FileSystem wrapper #856

Closed

rjzamora mentioned this pull request Feb 27, 2024

Introduce basic "cudf" backend for Dask Expressions rapidsai/cudf#14805

Merged

2 tasks

phofl mentioned this pull request Mar 7, 2024

__dask_keys__ and future identifiers are different #952

Open

fjetter mentioned this pull request Mar 11, 2024

Ensure tokens on futures are unique dask/distributed#8569

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expr as singleton #798

Expr as singleton #798

fjetter commented Jan 23, 2024 •

edited

Loading

mrocklin commented Jan 23, 2024

mrocklin Jan 23, 2024

rjzamora Jan 23, 2024

fjetter Jan 23, 2024

rjzamora Jan 23, 2024

fjetter commented Jan 23, 2024 •

edited

Loading

fjetter commented Jan 24, 2024

rjzamora commented Jan 24, 2024 •

edited

Loading

fjetter Feb 1, 2024

fjetter Feb 1, 2024

fjetter Feb 1, 2024

fjetter Feb 1, 2024

fjetter commented Feb 1, 2024

fjetter commented Feb 1, 2024

fjetter commented Feb 1, 2024

phofl commented Feb 1, 2024

fjetter commented Feb 1, 2024

phofl commented Feb 1, 2024

fjetter commented Feb 1, 2024

fjetter commented Feb 1, 2024

fjetter commented Feb 2, 2024

fjetter Feb 2, 2024

phofl Feb 2, 2024 •

edited

Loading

fjetter Feb 2, 2024

phofl Feb 2, 2024

fjetter Feb 2, 2024

fjetter Feb 2, 2024

fjetter commented Feb 2, 2024

fjetter commented Feb 2, 2024

phofl commented Feb 2, 2024

Expr as singleton #798

Expr as singleton #798

Conversation

fjetter commented Jan 23, 2024 • edited Loading

mrocklin commented Jan 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter commented Jan 23, 2024 • edited Loading

fjetter commented Jan 24, 2024

rjzamora commented Jan 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter commented Feb 1, 2024

fjetter commented Feb 1, 2024

fjetter commented Feb 1, 2024

phofl commented Feb 1, 2024

fjetter commented Feb 1, 2024

phofl commented Feb 1, 2024

fjetter commented Feb 1, 2024

fjetter commented Feb 1, 2024

fjetter commented Feb 2, 2024

Choose a reason for hiding this comment

phofl Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter commented Feb 2, 2024

fjetter commented Feb 2, 2024

phofl commented Feb 2, 2024

fjetter commented Jan 23, 2024 •

edited

Loading

fjetter commented Jan 23, 2024 •

edited

Loading

rjzamora commented Jan 24, 2024 •

edited

Loading

phofl Feb 2, 2024 •

edited

Loading