This is a research project to evaluate performing distributed SQL queries from Python, using Ray and DataFusion.
- Demonstrate how easily new systems can be built on top of DataFusion. See the design documentation to understand how RaySQL works.
- Drive requirements for DataFusion's Python bindings.
- Create content for an interesting blog post or conference talk.
- Build and support a production system.
Run the following example live in your browser using a Google Colab notebook.
import ray
from raysql.context import RaySqlContext
from raysql.worker import Worker
# Start our cluster
ray.init()
# create some remote Workers
workers = [Worker.remote() for i in range(2)]
# create a remote context and register a table
ctx = RaySqlContext.remote(workers)
ray.get(ctx.register_csv.remote('tips', 'tips.csv', True))
# Parquet is also supported
# ctx.register_parquet('tips', 'tips.parquet')
result_set = ray.get(ctx.sql.remote('select sex, smoker, avg(tip/total_bill) as tip_pct from tips group by sex, smoker'))
print(result_set)
- RaySQL can run all queries in the TPC-H benchmark
- Mature SQL support (CTEs, joins, subqueries, etc) thanks to DataFusion
- Support for CSV and Parquet files
- Requires a shared file system currently
This chart shows the performance of RaySQL compared to Apache Spark for SQLBench-H at a very small data set (10GB), running on a desktop (Threadripper with 24 physical cores). Both RaySQL and Spark are configured with 24 executors.
RaySQL is ~1.9x faster overall for this scale factor and environment with disk-based shuffle.
Spark is much faster on some queries, likely due to broadcast exchanges, which RaySQL hasn't implemented yet.
I'm planning on experimenting with the following changes to improve performance:
- Make better use of Ray futures to run more tasks in parallel
- Use Ray object store for shuffle data transfer to reduce disk I/O cost
- Keep upgrading to newer versions of DataFusion to pick up the latest optimizations
# prepare development environment (used to build wheel / install in development)
python3 -m venv venv
# activate the venv
source venv/bin/activate
# update pip itself if necessary
python -m pip install -U pip
# install dependencies (for Python 3.8+)
python -m pip install -r requirements-in.txt
Whenever rust code changes (your changes or via git pull
):
# make sure you activate the venv using "source venv/bin/activate" first
maturin develop
python -m pytest
Create a release build when running benchmarks, then use pip to install the wheel.
maturin develop --release
To change test dependencies, change the requirements.in
and run
# install pip-tools (this can be done only once), also consider running in venv
python -m pip install pip-tools
python -m piptools compile --generate-hashes -o requirements-310.txt
To update dependencies, run with -U
python -m piptools compile -U --generate-hashes -o requirements-310.txt
More details here