Skip to content
Permalink

Comparing changes

This is a direct comparison between two commits made in this repository or its related repositories. View the default comparison for this range or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: delta-io/delta-rs
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 2ec2ee3b731a3d77784385106d8c64bce2831d48
Choose a base ref
..
head repository: delta-io/delta-rs
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: f8c344d49647613e341866ca350016fe2731afd0
Choose a head ref
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -6,7 +6,7 @@
<p align="center">
A native Rust library for Delta Lake, with bindings to Python
<br>
<a href="https://delta-io.github.io/delta-rs/python/">Python docs</a>
<a href="https://delta-io.github.io/delta-rs/">Python docs</a>
·
<a href="https://docs.rs/deltalake/latest/deltalake/">Rust docs</a>
·
@@ -48,7 +48,7 @@ API that lets you query, inspect, and operate your Delta Lake with ease.

[pypi]: https://pypi.org/project/deltalake/
[pypi-dl]: https://img.shields.io/pypi/dm/deltalake?style=flat-square&color=00ADD4
[py-docs]: https://delta-io.github.io/delta-rs/python/
[py-docs]: https://delta-io.github.io/delta-rs/
[rs-docs]: https://docs.rs/deltalake/latest/deltalake/
[crates]: https://crates.io/crates/deltalake
[crates-dl]: https://img.shields.io/crates/d/deltalake?color=F75101
46 changes: 46 additions & 0 deletions crates/benchmarks/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
[package]
name = "delta-benchmarks"
version = "0.0.1"
authors = ["David Blajda <db@davidblajda.com>"]
homepage = "https://github.com/delta-io/delta.rs"
license = "Apache-2.0"
keywords = ["deltalake", "delta", "datalake"]
description = "Delta-rs Benchmarks"
edition = "2021"

[dependencies]
clap = { version = "4", features = [ "derive" ] }
chrono = { version = "0.4.31", default-features = false, features = ["clock"] }
tokio = { version = "1", features = ["fs", "macros", "rt", "io-util"] }
env_logger = "0"

# arrow
arrow = { workspace = true }
arrow-array = { workspace = true }
arrow-buffer = { workspace = true }
arrow-cast = { workspace = true }
arrow-ord = { workspace = true }
arrow-row = { workspace = true }
arrow-schema = { workspace = true, features = ["serde"] }
arrow-select = { workspace = true }
parquet = { workspace = true, features = [
"async",
"object_store",
] }

# serde
serde = { workspace = true, features = ["derive"] }
serde_json = { workspace = true }

# datafusion
datafusion = { workspace = true }
datafusion-expr = { workspace = true }
datafusion-common = { workspace = true }
datafusion-proto = { workspace = true }
datafusion-sql = { workspace = true }
datafusion-physical-expr = { workspace = true }

[dependencies.deltalake-core]
path = "../deltalake-core"
version = "0"
features = ["datafusion"]
55 changes: 55 additions & 0 deletions crates/benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Merge
The merge benchmarks are similar to the ones used by [Delta Spark](https://github.com/delta-io/delta/pull/1835).


## Dataset

Databricks maintains a public S3 bucket of the TPC-DS dataset with various factor where requesters must pay to download this dataset. Below is an example of how to list the 1gb scale factor

```
aws s3api list-objects --bucket devrel-delta-datasets --request-payer requester --prefix tpcds-2.13/tpcds_sf1_parquet/web_returns/
```

You can generate the TPC-DS dataset yourself by downloading and compiling [the generator](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp)
You may need to update the CFLAGS to include `-fcommon` to compile on newer versions of GCC.

## Commands
These commands can be executed from the root of the benchmark crate. Some commands depend on the existance of the TPC-DS Dataset existing.

### Convert
Converts a TPC-DS web_returns csv into a Delta table
Assumes the dataset is pipe delimited and records do not have a trailing delimiter

```
cargo run --release --bin merge -- convert data/tpcds/web_returns.dat data/web_returns
```

### Standard
Execute the standard merge bench suite.
Results can be saved to a delta table for further analysis.
This table has the following schema:

group_id: Used to group all tests that executed as a part of this call. Default value is the timestamp of execution
name: The benchmark name that was executed
sample: The iteration number for a given benchmark name
duration_ms: How long the benchmark took in ms
data: Free field to pack any additonal data

```
cargo run --release --bin merge -- standard data/web_returns 1 data/merge_results
```

### Compare
Compare the results of two different runs.
The a Delta table paths and the `group_id` of each run and obtain the speedup for each test case

```
cargo run --release --bin merge -- compare data/benchmarks/ 1698636172801 data/benchmarks/ 1699759539902
```

### Show
Show all benchmarks results from a delta table

```
cargo run --release --bin merge -- show data/benchmark
```
Loading