Website • Docs • Installation • 10-minute tour of Daft • Community and Support
Daft is a fast, Pythonic and scalable open-source dataframe library built for Python and Machine Learning workloads.
Daft is currently in its Beta release phase - please expect bugs and rapid improvements to the project. We welcome user feedback/feature requests in our Discussions forums
Table of Contents
The Daft dataframe is a table of data with rows and columns. Columns can contain any Python objects, which allows Daft to support rich complex data types such as images, audio, video and more.
- Any Data: Beyond the usual strings/numbers/dates, Daft columns can also hold complex multimodal data such as Images, Embeddings and Python objects. Ingestion and basic transformations of complex data is extremely easy and performant in Daft.
- Notebook Computing: Daft is built for the interactive developer experience on a notebook - intelligent caching/query optimizations accelerates your experimentation and data exploration.
- Distributed Computing: Rich complex formats such as images can quickly outgrow your local laptop's computational resources - Daft integrates natively with Ray for running dataframes on large clusters of machines with thousands of CPUs/GPUs.
Install Daft with pip install getdaft
.
For more advanced installations (e.g. installing from source or with extra dependencies such as Ray and AWS utilities), please see our Installation Guide
Check out our 10-minute quickstart!
In this example, we load images from an AWS S3 bucket's URLs and resize each image in the dataframe:
import daft
# Load a dataframe from filepaths in an S3 bucket
df = daft.from_glob_path("s3://daft-public-data/laion-sample-images/*")
# 1. Download column of image URLs as a column of bytes
# 2. Decode the column of bytes into a column of images
df = df.with_column("image", df["path"].url.download().image.decode())
# Resize each image into 32x32
df = df.with_column("resized", df["image"].image.resize(32, 32))
df.show(3)
To see the full benchmarks, detailed setup, and logs, check out our benchmarking page.
- 10-minute tour of Daft - learn more about Daft's full range of capabilities including dataloading from URLs, joins, user-defined functions (UDF), groupby, aggregations and more.
- User Guide - take a deep-dive into each topic within Daft
- API Reference - API reference for public classes/functions of Daft
To start contributing to Daft, please read CONTRIBUTING.md
To help improve Daft, we collect non-identifiable data.
To disable this behavior, set the following environment variable: DAFT_ANALYTICS_ENABLED=0
The data that we collect is:
- Non-identifiable: events are keyed by a session ID which is generated on import of Daft
- Metadata-only: we do not collect any of our users’ proprietary code or data
- For development only: we do not buy or sell any user data
Please see our documentation for more details.
Dataframe | Query Optimizer | Complex Types | Distributed | Arrow Backed | Vectorized Execution Engine | Out-of-core |
---|---|---|---|---|---|---|
Daft | Yes | Yes | Yes | Yes | Yes | Yes |
Pandas | No | Python object | No | optional >= 2.0 | Some(Numpy) | No |
Polars | Yes | Python object | No | Yes | Yes | Yes |
Modin | Eagar | Python object | Yes | No | Some(Pandas) | Yes |
Pyspark | Yes | No | Yes | Pandas UDF/IO | Pandas UDF | Yes |
Dask DF | No | Python object | Yes | No | Some(Pandas) | Yes |
Check out our dataframe comparison page for more details!
Daft has an Apache 2.0 license - please see the LICENSE file.