-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Post-mortem: why an easy workflow was horribly non-performant, and what we could do to make it easier for users to write fast dask code #301
Comments
@crusaderky this is great. Maybe a good blogpost? |
@crusaderky this is an excellent and very valuable writeup. Thanks for taking the time to do this.
We've also talked about doing column pruning automatically. This is one of the optimizations we hope would be enabled by high level expressions / revising high level graphs dask/dask#7933 cc @rjzamora. Might be worth adding that to the list. |
Updated post. |
We have mentioned this before, but dask-awkward has a nice example of following the metadata through many layers to be able to prune column loads. It's enabled by the rich awkward "typetracer" (a no-data array) that they implemented specifically for this purpose and just works through many operations where fake data or zero-length array might not. |
I'm curious about a couple human / API design aspects of this that I think are also worth looking into. Two things we see a lot of, that don't need to be there:
There are 4 unnecessary recomputations for The So how could we make it easier to recognize how expensive this is?
Another thought: Finally, |
I agree that using |
This is a really nice write-up! What do people here think about caching? Agree when computations happen could be surprising to an end user. At a minimum, workflows could avoid repeating a computation unnecessarily. Frequently point users (even experienced developers) to graphchain. Perhaps Dask would benefit by baking this in. cc @lsorber (in case you have thoughts on any of this) |
It'd also make a pretty fascinating short talk for one of the dask demo days |
About the pure-python code: @gjoseph92 I wonder if it would be feasible to offer a simple flag in coiled.Cluster, e.g. |
Executive summary
Today, the user experience of a typical novice to intermediate dask.dataframe user can be very poor. Building a workflow that is supposedly very straightforward can result in an extremely non-performant output with a tendency to randomly kill off workers. At the end of this post you'll find 13 remedial actions, 10 of which can be sensibly achieved in a few weeks, which can drastically improve the user experience.
Introduction
I recently went through a demo notebook, written by a data scientist, whose purpose is to showcase
dask.dataframe
to new dask users through a real-life use case. The notebook does what I would call light data preprocessing on a 40 GiB parquet dataset of NYC taxis, with the purpose of later feeding them into a machine learning algorithm.The first time I ran it, the notebook ran in 25 minutes and required hosts mounting a bare minimum of 32 GiB RAM each. After a long day of tweaking it, I brought it down to 2 minutes runtime and 8 GiB per host RAM requirements.
The problem is this: the workflow implemented by the notebook is not rocket science. Getting to a performant implementation is something you would expect to be a stress-free exercise for an average data scientist; instead it took a day's worth of anger-debugging from a dask maintainer to make it work properly.
This thread is a post-mortem of my experience with it, detailing all the roadblocks that both the original coder and myself hit, with proposals on how to make it a smooth sail in the future.
The algorithm
What's implemented is a standard machine learning pre-processing flow:
Implementation
Data loading and column manipulation
This first part loads up the dataframe, generates a few extra columns as a function of other columns, and drops unnecessary columns.
The dataset is publicly accessible - you may reproduce this on your own.
Original code
To an intermediate user's eye, this looks OK.
But it is very, very bad:
print
statement on the second line reads the whole dataset into memory and then discards it.repartition()
them into smaller chunks, the whole thing becomes a lot more manageable, even if the initial load still requires to compute everything at once.Revised code
Before starting the cluster:
(this is Coiled-specific. Other clusters will require you to manually set the config on all workers).
Note that the call to
repartition
reads the whole thing in memory and then discards it. This takes a substantial amount of time, but it's the best I could do. It is wholly avoidable though. I also feel that a novice user should not be bothered with having to deal with oversized chunks themselves:When repartition() is unavoidable, because it's in the middle of a computation, it could avoid computing everything:
As for PyArrow strings: I am strongly convinced they should be the default.
I understand that setting PyArrow strings on by default would cause dask to deviate from pandas. I think it's a case where the deviation is worthwhile - pandas doesn't need to cope with the GIL and serialization!
Being forced to think ahead with column slicing is another interesting discussion. It could be avoided if each column was a separate dask key. For the record, this is exactly what
xarray.Dataset
does. Alternatively, High level expressions (dask#7933) would allow rewriting the dask graph on the fly. With either of these changes, the columns you don't need would never leave the disk (as long as you set chunk_size=1). Also, it would mean thatlen(ddf)
would have to load a single column (seconds) instead of the whole thing (minutes).I appreciate that introducing splitting by column in dask.dataframe would be a very major effort - but I think it's very likely worth the price.
A much cheaper fix to
len()
:DataFrame.__len__
/Series.__len__
should raise a warning if it triggers a compute dask#9857Drop rows
After column manipulation, we move to row filtering:
Original code
This snippet recomputes everything so far (load from s3 AND column preprocessing), 😶 FIVE 😶 TIMES 😶:
len(ddf.index)
.Again, if the graph was split by columns or was rewritten on the fly by high level expressions, these three would be much less of a problem.
repartition(partition_size=...)
under the hood callscompute()
and then discardseverything.
persist()
recomputes everything from the beginning one more time.The one thing that is inexpensive is the final call to
len(ddf.index)
, because it's immediately after a persist().Revised code
The repartition() call is no longer there, since I already called it once and now partitions are guaranteed to be smaller or equal to before.
Like before, I've outright removed the
print()
statements. If I had to retain them, I would push them further down, immediately after a call to persist(), so that the computation is only done once:Joins
Download the "Taxi Zone Lookup Table (CSV) from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page, and save it to
data/taxi+_zone_lookup.csv
.Original code
Again, this sneaks in more pure-python strings into the dataframe.
This would be solved by force-casting them to
string[pyarrow]
infrom_pandas()
(which is called under the hood bymerge()
):It also does a typical mistake of using pure-python code for the sake of simplicity instead of going through a pandas join. While this is already a nonperformant in pandas, when you move to dask you have the problem that it hogs the GIL.
The sensible approach to limiting this problem would be to show the user a plot of how much time they spent on each task:
In addition to the above ticket, we need documentation & evangelism to teach the users that to debug a non-performant workflow they can look at Prometheus, and at which metrics they should look at first. At the moment, Prometheus is a single, hard-to-notice page in the dask docs.
The second join is performed on a much bigger dataset than necessary. There's no real solution to this - this is an abstract algorithmic optimization that the developer could have noticed themselves.
Finally, the PUSuperborough and DOSuperborough columns could be dropped. I know they're unnecessary by reading the next section below about categorization. Again, nothing we (dask devs) can do here.
Revised code
Categorization
This dataset is going to be fed into a machine learning engine, so everything that can be converted into a domain, should:
The first call to
repartition()
computes everything since the previous call topersist()
- this includes the pure-python joins - and then discards it. Then,persist()
computes it again.The call to
astype("category")
is unnecessary.categorize()
, while in this case is OK since it's (accidentally?) just after apersist()
, is a major pain point on its own:Revised code
Disk write
Original code
The final call to
repartition()
recomputes everything sincepersist()
and then discards it - in this case, not that much.Revised code
Action points recap
Low effort
DataFrame.__len__
/Series.__len__
should raise a warning if it triggers a compute dask#9857Intermediate effort
High effort
Very high effort
dask.Dataframe
by columnThe text was updated successfully, but these errors were encountered: