-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expr as singleton #798
Expr as singleton #798
Conversation
9645a65
to
cae4df1
Compare
In SymPy we switched to using |
dask_expr/_core.py
Outdated
if self._required_attribute: | ||
dep = next(iter(self.dependencies()))._meta | ||
if not hasattr(dep, self._required_attribute): | ||
_dependencies = [operand for operand in operands if isinstance(operand, Expr)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make this a cached property? I think that ideally we have as little computed state as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that we shouldn't need to "attach" _dependencies to the Expr class here, but that information is being used to check _required_attribute
below anyway.
I don't think we really need the _required_attribute
check anymore if we update the Expr.__getattr__
to avoid throwing an AttributeError
when any metadata calculation (._meta
, ._meta_chunk
, etc.) fails. I'll submit a simple PR to strip out the _required_attribute
stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just lazy and don't really care what we do with _dependencies
. Personally, I consider the cached property an indirection and a little more difficult to grasp but I don't have strong opinions here. We have to compute it either way at this point so it's just a matter of code organization.
Getting rid of the _required_attribute
check would of course simplify this quite a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR removes _required_atttribute
: #799
(not that removing that attribute solves anything here - just nice to remove unnecessary code)
Pushed a version of the caching as I proposed in #797 (comment) but I'm using an attribute instead of a list with the key. With this, the TPC Q1 optimization is down to 200ms Edit: removed that again since the caching is not straight forward, see #798 (comment) for a discussion |
cc1470c
to
19bede2
Compare
19bede2
to
cae4df1
Compare
Well, the singleton approach as suggested here will clash pretty hard with the way caching is implemented on the In essence, the tokenization of the expression class does not reliably tie the expression to the dataset to be read (the path is not sufficient, a modified timestamp or etag should be part of this as well). If we reuse instances, this will fail hard as soon as the dataset is mutated. |
Ah, interesting. The global caching component of EDIT: I guess the global cache is only cleared if |
b90badf
to
a24c6c9
Compare
for file in files_for_checksum: | ||
# The checksum / file info is usually already cached by the fsspec | ||
# FileSystem dir_cache since this info was already asked for in | ||
# _collect_dataset_info | ||
checksum.append(fs.checksum(file)) | ||
dataset_info["checksum"] = tokenize(checksum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To deal with the cache consistency problem described in #800 I am calculating a checksum here. For s3 this falls back to using the ETag provided in the listdir response. This should not add any overhead since this stuff is already cached by fsspec.
We're either taking the checksum of the metadata file or of all files that we iterate over. At this point this listdir operation is already done so the checksum identifies every dataset uniquely. Since adding this checksum to the dataset_info, this also guarantees that the cache for the plan is invalidated if the dataset changes.
dask_expr/io/parquet.py
Outdated
@@ -432,6 +416,7 @@ class ReadParquet(PartitionsFiltered, BlockwiseIO): | |||
"kwargs": None, | |||
"_partitions": None, | |||
"_series": False, | |||
"_dataset_info_cache": list, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new concept here is that I am moving off a global cache. The dataset_info is always calculated whenever a user calls read_parquet(
foo)
and will therefore always receive an accurate representation of the dataset at the time this is called.
This dataset_info is cached in this paramter. I am choosing a list as a container but this could be anything. I could also just set the operand and mutate the expression in place.
The benefit of using a paramter for this cache is that the cache will naturally propagate to all derived instances, e.g. whenever we rewrite the expression using Expr.substitute_parameters
. This allows us to maintain the cache during optimization and it ties the lifetime of the cache to the lifetime of the expression ancestry removing any need for us to invalidate the cache ever.
def _name(self): | ||
return ( | ||
funcname(type(self)).lower() | ||
+ "-" | ||
+ _tokenize_deterministic(self.checksum, *self.operands) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this checksum is part of the _name
allowing us to differentiate expressions that point to modified states of the dataset. It also allows us to reuse already cached "plans / divisions" if the dataset did not change which is the most common case
dask_expr/io/parquet.py
Outdated
def checksum(self): | ||
return self._dataset_info["checksum"] | ||
|
||
@property | ||
def _dataset_info(self): | ||
if rv := self.operand("_dataset_info_cache"): | ||
return rv[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the ReadParquet
expression is initialized first during a read_parquet
this cache is empty and we'll fetch the dataset_info essentially during construction time of the expression object.
Subsequent expressions that are derived which are inheriting the cache will just access this making the __new__
call instantaneous.
interesting, the dask/dask tests are running into tokenization issues while the dask-expr tests are all green. Will have to dig into this a little |
ok, this one requires two minor fixes in dask/dask, see dask/dask#10880 |
Well, the performance is better but still not where I'd like it to be We have a very consistent improvement with this PR but some of the queries appear to still scratch the 1s mark. The reason for this is that the current branch uses a weakvalue dictionary to determine whether an object already exists but it looks like the expr objects are going out of scope too quickly such that the instances are indeed being recreated frequently. Looking at Q1, the main branch creates about 20k new expression instances and with this branch I can half this. However, over the lifetime of the optimization there are only about 500 unique expressions. |
We should be able to remove the weak refs if we keep track of the "number of references" ourselves on top of this PR |
If we combine this approach with #831 we'll get down further The one thing that is not budging is Q21. Looking into that, it appears that there is an insane amount of substitution happening during blockwise fusion |
21 is because the 3 filters blow up the number of expressions, similar to what assign did in merge in the past. I am almost done with the filter pushdown in merges which will remove those problems entirely |
With both changes, Q1 optimize is about as fast as it can get at around 500ms. Further speedup is possible by improving tokenization, simplifying the attribute lookup for Expr and speeding up convergence of the optimizer overall to require fewer steps. Even the meta computation is rather negligible at this point. I think 500ms is fine (for now) but I'd like to understand the substitude thing a little better. |
There are around ~700 unique expressions in that query during optimization. That's not that much (compared to ~500 for Q1) |
ok, so I looked into the Q21 performance which has nothing to do with caching but is an algorithmic problem in the recursion of substitute. I documented this in #835 |
@@ -36,7 +36,7 @@ class FromGraph(IO): | |||
conversion from legacy dataframes. | |||
""" | |||
|
|||
_parameters = ["layer", "_meta", "divisions", "_name"] | |||
_parameters = ["layer", "_meta", "divisions", "keys", "name_prefix"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FromGraph
thing got a little odd. For context, this thing is used in two ways
- As a wrapper for futures when persist is called
- As a wrapper for legacy dataframes
The previous implementation accepted the _name
as an input argument. For the persisted dataframe, this was the name of the original expression. For the wrapped one, it is the name of the legacy dataframe.
The problem is that the legacy dataframe does not have the same strict uniqueness guarantees as the expressions do so setting duplicate names is very easy. In fact, our tests where doing just that! (and still are). This caused the Expr.__new__
to deduplicate and effectively ignore the second dataframe... oops.
For persist it is also a little odd since if the FromGraph
expression is inheriting the exact name of it's ancestor, there exist now two expressions of a different type with the same name. This is odd.
Effectively, with this chosen singleton implementation, setting the _name
explicitly instead of calculating it using a hash is a cardinal sin and will cause all sorts of weird things to happen.
Now, this can be fixed but that has a rather big caveat. If I have to redefine the name of the expression, I actually also have to rewrite the graph! Many places in dask-expr are (imo incorrectly) assuming that output key names of a dataframe layer/expression are universally built as (df._name, i)
and are hard coding this when implementing their own layer (instead of iterating over i
, iterating over df.__dask_keys__()
would maintain the abstraction). This rewrite adds effectively another layer of keys. In reality this is really ugly since when computing something on top of a persisted dataframe, there will always be this dummy key in between.
Alternatively, I could make the singleton deduplication type aware to give the FromGraph
thing an excuse to overwrite the name. However, if we truly stick with singletons that are based on the name, I would prefer the name to actually be unique which required all implementations to stop hard coding keys of another expression/dataframe and iterate properly over the __dask_keys__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, this can be fixed but that has a rather big caveat. If I have to redefine the name of the expression, I actually also have to rewrite the graph! Many places in dask-expr are (imo incorrectly) assuming that output key names of a dataframe layer/expression are universally built as (df._name, i) and are hard coding this when implementing their own layer (instead of iterating over i, iterating over df.dask_keys() would maintain the abstraction). This rewrite adds effectively another layer of keys. In reality this is really ugly since when computing something on top of a persisted dataframe, there will always be this dummy key in between.
This is a good point, we should fix this instead of relying on df._name and i
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will investigate if it is possible to just use __dask_keys__
everywhere but I'd prefer doing this in a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes totally agree, this should definitely be a follow up (also doesn't have to be you, I could pick this up as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got started on this, see
That commit replaces all occurrences that match \((\w+(_\w+)?)(?<!self)\._name,
but something appears to be still missing. It is possible and not as ugly as I thought it would be
def _layer(self): | ||
return dict(self.operand("layer")) | ||
dsk = dict(self.operand("layer")) | ||
# The name may not actually match the layers name therefore rewrite this | ||
# using an alias | ||
for part, k in enumerate(self.operand("keys")): | ||
dsk[(self._name, part)] = k | ||
return dsk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intuitive fix for me would've been to overwrite __dask_keys__
instead to point to the appropriate keys but as I explained above, the implementation of Expr.__dask_keys__
is hard coded in many places.
I opened #838 which points the dask/dask tests to dask/dask#10880 such that we can see if CI is happy |
CI when pointing to dask/dask with the fixes #838 is green |
thx @fjetter |
Closes #796
Closes #800
Expr.rewrite
is expensive since it checks for an element in a list. This cae4df1 commit removes that check and simply ensures the two rewrites that are currently used exists. I think this would be a sensible change but a proper PR should do this in a clean fashion. This removes all issues with the rewrite stack in my testingExpr
objects are in fact singletons by implementing a__new__
method. This ensures that there can always ever by oneExpr
with a given_name
(that is based on tokenization)A couple of observations