-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory issues despite lazy operations #1193
Comments
For possible reference: dask/distributed#2602 |
Thanks @zklaus, that sounds exactly like the problem here. I found a (relatively) easy workaround for this: By moving the preprocessor step Is this something that might be interesting for us? This could be implemented as an optional |
nice find @schlunma - unfortunately we need to perform the preprocessor: concatenation at the beginning, data is fragmented most of the times. Unless I am misinterpreting your workaround? |
Ahh! Yes --- now that I think about it --- most of the time related preprocessors would fail. In my case it worked just fine (I only used I will do some more tests tomorrow with a dataset that is not spread among different files and see how that performs - my guess is that it works well, similar to the test I did with the raw |
Sounds quite similar to the behaviour I've observed in #968 (comment), and that conversation in the dask repo is interesting! Not sure if what you mention here:
is actually the backpressure that they're talking about. The way I understand it, "backpressure" would be an analogy to a channel flow (or any kind of flow, really) where a bottleneck at one point causes a reduction of the inflow from upstream, and this feature is actually desireable. What you describe sound more like "over-pressure" (see the reference they pointed to here), i.e. too much data flowing in from upstream causing an overload on the bottleneck, undesireable.. |
Yes @Peter9192, you are 100% correct! |
Looks like the linked dask issue finally got fixed, so it would be interesting to see if our problem also disappears with the next release of dask. |
Yes, though note that that specific issue is in distributed. If I recall correctly, we might be using different dask schedulers, so that the implications might be weaker. |
Hey @bouweandela, I have run the recipe above with 4 cores and 8GB memory using #1714 and it completes without errors! |
That's great to hear. Note that there have been recent, massive improvements in dask/distributed on these things. C.f., e.g., dask/distributed#7128 and related issues and PRs in the dask and distributed changelog. When we get serious about it, I would suggest to pin at least |
The recipe runs now with the current version of the tool. I could also get a significant speed up by using a distributed scheduler with #2049. Closing this now, feel free to re-open if necessary. |
Right now, I'm trying to analyze 25 years of monthly ERA5 data (variable
ta
). Since this dataset is 4D (time, height, lat, lon), the file sizes are pretty large. After concantenation, the cube's shape is(300, 37, 721, 1440)
, which corresponds to about 86 GiB assumingfloat64
.When using the following recipe to extract the zonal means
on a node with 64 GiB memory, the process hangs indefinitely after reaching
I'm fairly sure this is due to memory issues. As the following memory samples (for every second) show, the memory slowly ramps up to maximum memory and then suddenly drops. I guess somehow the OS shuts down the process after filling up the entire RAM.
The
zonal_statistics
preprocessor that is used here should be 100% lazy:ESMValCore/esmvalcore/preprocessor/_area.py
Line 142 in f24575b
So I'm really not sure what's going on here. I would have expected that a node with 64 GiB of memory should be able to handle 86 GiB of lazy data. I think this is somehow related to
dask
, as adding acube.lazy_data().mean(axis=3).compute()
to the line above triggers the exact same behavior. At this moment, the (lazy) array looks like that (for me that looks reasonable):For comparison, setting up a random array with the same sizes in
dask
and doing the same operation works perfectly well:@ESMValGroup/esmvaltool-coreteam Has anyone made similar experiences? I know that this is not exactly an issue of ESMValTool, but maybe there is something we can configure in
dask
that helps here? Apart from using a larger node, there is no way to evaluate this dataset at the moment.The text was updated successfully, but these errors were encountered: