This project was created by Winder.ai, an ML consultancy, and funded by Protocol Labs, the creators of Bacalhau, a Compute over Data framework for IPFS & Filecoin
This repository contains a detailed analysis of the current state of general-purpose computation frameworks and a series of sample demos and benchmarks.
We've intentionally reviewed a wide range of heterogeneous frameworks to provide a broad overview and capture nuances between frameworks. Thus, next to big data tools like Apache Hadoop and Apache Spark, we've reviewed databases such as Postgres and Snowflake. Note the latter is a proprietary SaaS warehouse product, despite the limited insights on its internal processing we've included it for its blazoned performance. We couldn't not wink to the pythonic data analysis world, so we've also covered top-rated analysis tools such as Pandas and its parallel/distributed brother Dask.
In this repository you find:
- A collection of working code examples to demonstrate a variety of use cases for different computing frameworks, the table below illustrates the coverage.
- A set of scripts and instructions to benchmark running time and resource utilization of different computing frameworks. Please see the instructions below.
We provide accompanying slides that summarize this work and report on the benchmarks results - [link to slides].
We provide working examples for embarrassingly parallel workloads that can be computed next to data.
Take a look at the sample-code/
folder for viewing the demos (no installation needed!).
If you want to run them live, please follow the instructions below.
hadoop | spark | pandas | dask | postgres | snowflake | |
---|---|---|---|---|---|---|
Word count | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Average house price | ✅ | ✅ | ✅ | ✅ | ✅ | |
Derivative dataset (i.e. head -n 10 <file.txt> ) |
✅ | ✅ | ✅ |
- Clone this repo
git clone https://github.com/winderai/bacalhau-landscape-analysis-benchmarks.git
on your machine. - Create the AWS resources to host your cluster. In this step you shall spin up one single EC2 instance.
- Install the single-node setup shipped with this repo.
- Install
jupyter lab
in your base environment. - Launch
cd sample-code && jupyter lab --ip=0.0.0.0
to run the sample notebooks insample-code/
.
The benchmarks consist of a timed Word count job running on all frameworks mentioned above. Each run is launched one at a time and requires some manual preliminary work (e.g. spin-up cluster). During the creation of AWS resources you can select the number of nodes you'd like to spawn, can do single or multi-node setups. The difference is the latter installation is way more cumbersome so take your time.
To facilitate the logging of running time, job parameters and various environment variables, the launch scripts use MLflow. Differently, resource usage (i.e., cpu, memory, disk) is logged via AWS CloudWatch. This means CloudWatch metrics can be fetched from their dashboard starting from 5-10 minutes after the experiment has been completed. This is to allow the metrics to flow into AWS sink.
Feel free to explore the benchmark/
directory to familiarize yourself with this setup, or follow the instructions below to spawn a cluster and run the benchmarks.
- Create the AWS resources to host your cluster. In this step, you'll set the number of EC2 instances you're going to spin up.
- Clone this repo
git clone https://github.com/winderai/bacalhau-landscape-analysis-benchmarks.git
on the main cluster node. - Install the various computation frameworks on your hosts by following the instructions, we provide Single-node and Multi-node setups, depending on the cluster size you'd like to deploy. You need
ssh
to access to the hosts. - [Optional] Install an AWS CloudWatch agent on each host.
- Run the benchmark scripts.
- [Optional] Fetch CloudWatch metrics through AWS console.
For further details on this repository please reach out to Enrico ([email protected]) or Phil ([email protected]).