Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smart automatic dask wrapping #530

Open
aulemahal opened this issue Nov 4, 2021 · 6 comments
Open

Smart automatic dask wrapping #530

aulemahal opened this issue Nov 4, 2021 · 6 comments
Labels
enhancement Indicates new feature requests

Comments

@aulemahal
Copy link

aulemahal commented Nov 4, 2021

Is your feature request related to a problem? Please describe.
Creating a dask array from a SparseArray, if chunks are not given, dask will automatically rechunk the data based on its size and the config (dask's array.chunk-size). The point of sparse arrays is that they can be enormous but still only hold a few values. Dask doesn't see that. As a result, from a single array many chunks are created, which multiplies the number of tasks in dask's graph and thus affects performance.

Evidently, when working directly with sparse and dask, one can explicitly give the chunks of the requested dask array, but this is not possible when using it under the hood.

My use case is a xarray.DataArray using a sparse.COO array that is included in a xarray.apply_ufunc call together with a dask-backed DataArray. In that case, xarray sends all inputs to dask.array.apply_gufunc and the wrapping into dask happens in dask.array.core.asarray. Our option is thus to pre-wrap our sparse array to a dask array, before the computation. I think it would be interesting if this was done implicitly.

Describe the solution you'd like
The cleanest option I see is to implement SparseArray.to_dask_array. It will be detected and used by dask automatically. There we could wrap to a dask array taking into account that the real size of the array is from .nnz and not .shape. Optionally, we could read the config of dask to respect array.chunk-size.

Describe alternatives you've considered
Alternatives are:

  1. Handling this in our function explicitly.
  2. Handling this in xarray.
  3. Handling this in dask (we might be able to cover scipy sparse arrays as well?).

But I felt that here was the best place.

Additional context
Raised by issue pangeo-data/xESMF#127.

Example

import sparse as sp
import dask.array as da

A = sp.COO([[0, 5000, 10000], [0, 5000, 10000]], [1, 2, 3])

da.from_array(A)
# dask.array<array, shape=(10001, 10001), dtype=int64, chunksize=(4096, 4096), chunktype=sparse.COO>

da.from_array(A, chunks={})
# dask.array<array, shape=(10001, 10001), dtype=int64, chunksize=(10001, 10001), chunktype=sparse.COO>
@aulemahal aulemahal added the enhancement Indicates new feature requests label Nov 4, 2021
@huard
Copy link

huard commented Nov 10, 2021

cc @dcherian : are there other options we should consider?

@hameerabbasi
Copy link
Collaborator

Is there a discussion ongoing with Dask devs? I think it should consider nbytes, which I believe was added back when by @mrocklin for this specfic purpose.

@huard
Copy link

huard commented Nov 10, 2021

Not from our end at least.

@hameerabbasi
Copy link
Collaborator

I'd be willing to review accept, and guide a PR for this (with the proposed solution), but I don't know enough about load-balancing to create a solution that is good enough.

@mrocklin
Copy link
Collaborator

mrocklin commented Nov 11, 2021 via email

@mrocklin
Copy link
Collaborator

mrocklin commented Nov 11, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Indicates new feature requests
Projects
None yet
Development

No branches or pull requests

4 participants