-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some Dask-SQL tests are extremely slow with dask-expr #1055
Comments
100 depth is a pretty complex graph since we're talking about expressions here. This is much more than I would naively assume given the "simple" SQL statement above.
Without more investigation I'm -1 for introducing such a catch-all cache. Historically, most of these endless runtimes could be traced back to a minor bug or were dealt with by introducing more targeted caching. |
FWIW I cannot reproduce the above since there doesn't appear to be a valid dask-sql package for OSX ARM |
Yes, If I remember correctly, dask-sql does an excessive amount of column renaming when the SQL query is mapped onto the dask/dataframe API. There is no doubt that the same logic can be expressed in a much simpler expression graph, but I'm assuming it would be a lot of work to change that. Therefore, these are the kinds of expression graphs dask-sql needs to produce for now. @charlesbluca may have thoughts on this.
Right, the "hack" I described above is just meant as a demonstration that caching seems to mitigate whatever the underlying issue is. We are technically caching the "lowered" version of every expression in |
Update: @charlesbluca shared a dask-only reproducer, and I added it to the top-level description. I guess it makes sense that this pattern would create a bloated/repetitive graph. |
I submitted #1059 to add basic caching. It is pretty clear that down-stream libraries (like dask-sql) may produce "deep" expression graphs, and and "diamond-like" branches in this deep graph will essentially multiply the size of the graph in the absence of caching. For the python-only reproducer above, the depth of the expression graph is |
@charlesbluca shared a Dask-SQL test/snippet that seems to "hang" when the query-planning is enabled in
dask.dataframe
. It turns out the operation does eventually finish, but that graph materialization is extremely slow in dask-expr for this particular expression graph.It is certainly possible that Dask-SQL is producing an expression graph that is more complicated than necessary. However, it is definitely not complicated enough to warrant such an extreme slowdown.
Reproducer:
Original dask-sql reproducer
Other Notes:
ddf._depth()
takes about 60 s, and returns100
(so not a terribly complex graph).ddf.pprint()
also seems to "hang", so it's a bit hard to inspect the expression graph.Known "Remedies":
As far as I can tell, the graph-materialization hang mostly goes away if
Expr.lower_once
is cached. For example, everything is considerably faster when I hack in a simple caching pattern:cc @fjetter @phofl - Seems like it makes sense to cache lowering behavior. WDYT?
The text was updated successfully, but these errors were encountered: