Dask

A minimal task scheduling abstraction and parallel arrays.

dask is a specification to describe task dependency graphs.
dask.array is a drop-in NumPy replacement (for a subset of NumPy) that encodes blocked algorithms in dask dependency graphs.
dask.async is a shared-memory asynchronous scheduler that efficiently executes dask dependency graphs on multiple cores.

See full dask documentation at http://dask.readthedocs.org

Install

python setup.py install

Dask Graphs

Consider the following simple program:

def inc(i):
    return i + 1

def add(a, b):
    return a + b

x = 1
y = inc(x)
z = add(y, 10)

We encode this as a dictionary in the following way:

d = {'x': 1,
     'y': (inc, 'x'),
     'z': (add, 'y', 10)}

While less aesthetically pleasing this dictionary may now be analyzed, optimized, and computed on by other Python code, not just the Python interpreter.

Dask Arrays

The dask.array module creates these graphs from NumPy-like operations

>>> import dask.array as da
>>> x = da.random.random((4, 4), blockshape=(2, 2))
>>> x.T[0, 3].dask
{('x', 0, 0): (np.random.random, (2, 2)),
 ('x', 0, 1): (np.random.random, (2, 2)),
 ('x', 1, 0): (np.random.random, (2, 2)),
 ('x', 1, 1): (np.random.random, (2, 2)),
 ('y', 0, 0): (np.transpose, ('x', 0, 0)),
 ('y', 0, 1): (np.transpose, ('x', 1, 0)),
 ('y', 1, 0): (np.transpose, ('x', 0, 1)),
 ('y', 1, 1): (np.transpose, ('x', 1, 1)),
 ('z',): (getitem, ('y', 0, 1), (0, 1))}

Finally, a scheduler executes these graphs to achieve the intended result. The dask.async module contains a shared memory scheduler that efficiently leverages multiple cores.

Dependencies

dask.core supports Python 2.6+ and Python 3.3+ with a common codebase. It is pure Python and requires no dependencies beyond the standard library. It is a light weight dependency.

dask.array depends on numpy.

dask.bag depends on toolz and dill.

LICENSE

New BSD. See License File.

Related Work

Task Scheduling

One might ask why we didn't use one of these other fine libraries:

Luigi
Joblib
mrjob
Any of the fine schedulers in numeric analysis (DAGue, ...)
Any of the fine high-throughput schedulers (Condor, Pegasus, Swiftlang, ...)

The answer is because we wanted all of the following:

Fine-ish grained parallelism (latencies around 1ms)
In-memory communication of intermediate results
Dependency structures more complex than map
Good support for numeric data
First class Python support
Trivial installation

Most task schedulers in the Python ecosystem target long-running batch jobs, often for processing large amounts of text and aren't appropriate for executing multi-core numerics.

Arrays

There are many "Big NumPy Array" or general distributed array solutions all with fine characteristics. Some projects in the Python ecosystem include the following:

There is a rich history of distributed array computing. An incomplete sampling includes the following projects:

Elemental
Plasma
Arrays in MLlib

Name		Name	Last commit message	Last commit date
Latest commit History 460 Commits
dask		dask
docs		docs
notebooks		notebooks
.gitignore		.gitignore
.travis.yml		.travis.yml
AUTHORS.md		AUTHORS.md
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.rst		README.rst
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dask

Install

Dask Graphs

Dask Arrays

Dependencies

LICENSE

Related Work

Task Scheduling

Arrays

About

Releases

Packages

Languages

License

PeterDSteinberg/dask

Folders and files

Latest commit

History

Repository files navigation

Dask

Install

Dask Graphs

Dask Arrays

Dependencies

LICENSE

Related Work

Task Scheduling

Arrays

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages