Skip to content

begelundmuller/benchmark-clickhouse-duckdb

Repository files navigation

Clickhouse vs. DuckDB benchmarks on local

This project benchmarks two simple queries against Clickhouse and DuckDB. It imports ~1.2GB of Parquet data into each datastore and runs the benchmarks as native queries. It runs all benchmark iterations in the same session and does not reset caches.

Warning: This is far from a rigorous benchmark.

Results

Benchmarks:

duckdb:groupby:         avg=1.390s min=1.301s max=1.543s (10 runs)
clickhouse:groupby:     avg=0.805s min=0.705s max=1.059s (10 runs)
duckdb:self-join:       avg=0.635s min=0.616s max=0.743s (10 runs)
clickhouse:self-join:   avg=0.701s min=0.660s max=0.743s (10 runs)

Executed on a Macbook Pro (2018) with 2.2 GHz 6-Core Intel Core i7 and 16 GB memory.

Disk usage:

Parquet:    1.2G
Clickhouse: 2.2G
DuckDB:     7.1G

Executable binary size:

Clickhouse: 363M
DuckDB:     37M

Instructions

  1. Clone this repo and cd into it

  2. Download data (~1.2GB of NYC taxi data)

mkdir -p data
cd data
curl https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv -o taxi_zone_lookup.csv
curl -O https://nyc-tlc.s3.amazonaws.com/trip+data/fhvhv_tripdata_2022-01.parquet
curl -O https://nyc-tlc.s3.amazonaws.com/trip+data/fhvhv_tripdata_2022-02.parquet
curl -O https://nyc-tlc.s3.amazonaws.com/trip+data/fhvhv_tripdata_2022-03.parquet
cd ..
  1. Install and run Clickhouse (source):
mkdir -p clickhouse
cd clickhouse
curl https://clickhouse.com/ | sh
./clickhouse server
  1. In a new terminal, import data into Clickhouse:
./clickhouse/clickhouse client --queries-file clickhouse_create_trips.sql
./clickhouse/clickhouse client --max_memory_usage 0 --query='INSERT INTO trips FORMAT Parquet' < data/fhvhv_tripdata_2022-01.parquet
./clickhouse/clickhouse client --max_memory_usage 0 --query='INSERT INTO trips FORMAT Parquet' < data/fhvhv_tripdata_2022-02.parquet
./clickhouse/clickhouse client --max_memory_usage 0 --query='INSERT INTO trips FORMAT Parquet' < data/fhvhv_tripdata_2022-03.parquet
  1. Install DuckDB:
mkdir -p duckdb
cd duckdb
# Works on Intel Macs (for ARM and Linux, see: https://duckdb.org/docs/installation/index)
curl -O -L https://github.com/duckdb/duckdb/releases/download/v0.3.2/duckdb_cli-osx-amd64.zip
unzip duckdb_cli-osx-amd64.zip
cd ..
  1. Insert data into DuckDB:
./duckdb/duckdb ./duckdb/db.duckdb -c "CREATE TABLE trips AS SELECT * FROM read_parquet('data/fhvhv_tripdata_2022-01.parquet')"
./duckdb/duckdb ./duckdb/db.duckdb -c "INSERT INTO trips SELECT * FROM read_parquet('data/fhvhv_tripdata_2022-02.parquet')"
./duckdb/duckdb ./duckdb/db.duckdb -c "INSERT INTO trips SELECT * FROM read_parquet('data/fhvhv_tripdata_2022-03.parquet')"
  1. Run the benchmark (requires Poetry):
poetry install
poetry run python benchmark.py
  1. Compute data size on disk:
echo "Parquet:" && du -hs data
echo "Clickhouse:" && du -hs clickhouse/store
echo "DuckDB:" && du -hs duckdb/db.duckdb
  1. Compute executable size:
echo "Clickhouse:" && du -hs clickhouse/clickhouse
echo "DuckDB:" && du -hs duckdb/duckdb

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages