Skip to content

Commit 33b7583

Browse files
authored
Merge branch 'main' into fix-write-to-fuse
2 parents 152c405 + accf6e0 commit 33b7583

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+4897
-2404
lines changed

.github/workflows/python_build.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ jobs:
118118
- name: Run tests
119119
run: |
120120
source venv/bin/activate
121-
python -m pytest -m '((s3 or azure) and integration) or not integration and not benchmark'
121+
python -m pytest -m '((s3 or azure) and integration) or not integration and not benchmark' --doctest-modules
122122
123123
- name: Test without pandas
124124
run: |

CONTRIBUTING.md

+75
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,78 @@ If you want to contribute something more substantial, see our "Projects seeking
1111
## Claiming an issue
1212

1313
If you want to claim an issue to work on, you can write the word `take` as a comment in it and you will be automatically assigned.
14+
15+
## Quick start
16+
17+
- Install Rust, e.g. as described [here](https://doc.rust-lang.org/cargo/getting-started/installation.html)
18+
- Have a compatible Python version installed (check `python/pyproject.toml` for current requirement)
19+
- Create a Python virtual environment (required for development builds), e.g. as described [here](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/)
20+
- Build the project for development (this requires an active virtual environment and will also install `deltalake` in that virtual environment)
21+
```
22+
cd python
23+
make develop
24+
```
25+
26+
- Run some Python code, e.g. to run a specific test
27+
```
28+
python -m pytest tests/test_writer.py -s -k "test_with_deltalake_schema"
29+
```
30+
31+
- Run some Rust code, e.g. run an example
32+
```
33+
cd crates/deltalake
34+
cargo run --examples basic_operations
35+
```
36+
37+
## Run the docs locally
38+
*This serves your local contens of docs via a web browser, handy for checking what they look like if you are making changes to docs or docstings*
39+
```
40+
(cd python; make develop)
41+
pip install -r docs/requirements.txt
42+
mkdocs serve
43+
```
44+
45+
## To make a pull request (PR)
46+
- Make sure all the following steps run/pass locally before submitting a PR
47+
```
48+
cargo fmt -- --check
49+
cd python
50+
make check-rust
51+
make check-python
52+
make develop
53+
make unit-test
54+
make build-docs
55+
```
56+
57+
## Developing in VSCode
58+
59+
*These are just some basic steps/components to get you started, there are many other very useful extensions for VSCode*
60+
61+
- For a better Rust development experience, install [rust extention](https://marketplace.visualstudio.com/items?itemName=1YiB.rust-bundle)
62+
- For debugging Rust code, install [CodeLLDB](https://marketplace.visualstudio.com/items?itemName=vadimcn.vscode-lldb). The extension should even create Debug launch configurations for the project if you allow it, an easy way to get started. Just set a breakpoint and run the relevant configuration.
63+
- For debugging from Python into Rust, follow this procedure:
64+
1. Add this to `.vscode/launch.json`
65+
```
66+
{
67+
"type": "lldb",
68+
"request": "attach",
69+
"name": "LLDB Attach to Python'",
70+
"program": "${command:python.interpreterPath}",
71+
"pid": "${command:pickMyProcess}",
72+
"args": [],
73+
"stopOnEntry": false,
74+
"environment": [],
75+
"externalConsole": true,
76+
"MIMode": "lldb",
77+
"cwd": "${workspaceFolder}"
78+
}
79+
```
80+
2. Add a `breakpoint()` statement somewhere in your Python code (main function or at any point in Python code you know will be executed when you run it)
81+
3. Add a breakpoint in Rust code in VSCode editor where you want to drop into the debugger
82+
4. Run the relevant Python code function in your terminal, execution should drop into the Python debugger showing `PDB` prompt
83+
5. Run the following in that promt to get the Python process ID: `import os; os.getpid()`
84+
6. Run the `LLDB Attach to Python` from the `Run and Debug` panel of VSCode. This will prompt you for a Process ID to attach to, enter the Python process ID obtained earlier (this will also be in the dropdown but that dropdown will have many process IDs)
85+
7. LLDB make take couple of seconds to attach to the process
86+
8. When the debugger is attached to the process (you will notice the debugger panels get filled with extra info), enter `c`+Enter in the `PDB` prompt in your terminal - the execution should continue until the breakpoint in Rust code is hit. From this point it's a standard debugging procecess.
87+
88+

README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
<p align="center">
77
A native Rust library for Delta Lake, with bindings to Python
88
<br>
9-
<a href="https://delta-io.github.io/delta-rs/python/">Python docs</a>
9+
<a href="https://delta-io.github.io/delta-rs/">Python docs</a>
1010
·
1111
<a href="https://docs.rs/deltalake/latest/deltalake/">Rust docs</a>
1212
·
@@ -48,7 +48,7 @@ API that lets you query, inspect, and operate your Delta Lake with ease.
4848

4949
[pypi]: https://pypi.org/project/deltalake/
5050
[pypi-dl]: https://img.shields.io/pypi/dm/deltalake?style=flat-square&color=00ADD4
51-
[py-docs]: https://delta-io.github.io/delta-rs/python/
51+
[py-docs]: https://delta-io.github.io/delta-rs/
5252
[rs-docs]: https://docs.rs/deltalake/latest/deltalake/
5353
[crates]: https://crates.io/crates/deltalake
5454
[crates-dl]: https://img.shields.io/crates/d/deltalake?color=F75101

crates/benchmarks/Cargo.toml

+46
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
[package]
2+
name = "delta-benchmarks"
3+
version = "0.0.1"
4+
authors = ["David Blajda <[email protected]>"]
5+
homepage = "https://github.com/delta-io/delta.rs"
6+
license = "Apache-2.0"
7+
keywords = ["deltalake", "delta", "datalake"]
8+
description = "Delta-rs Benchmarks"
9+
edition = "2021"
10+
11+
[dependencies]
12+
clap = { version = "4", features = [ "derive" ] }
13+
chrono = { version = "0.4.31", default-features = false, features = ["clock"] }
14+
tokio = { version = "1", features = ["fs", "macros", "rt", "io-util"] }
15+
env_logger = "0"
16+
17+
# arrow
18+
arrow = { workspace = true }
19+
arrow-array = { workspace = true }
20+
arrow-buffer = { workspace = true }
21+
arrow-cast = { workspace = true }
22+
arrow-ord = { workspace = true }
23+
arrow-row = { workspace = true }
24+
arrow-schema = { workspace = true, features = ["serde"] }
25+
arrow-select = { workspace = true }
26+
parquet = { workspace = true, features = [
27+
"async",
28+
"object_store",
29+
] }
30+
31+
# serde
32+
serde = { workspace = true, features = ["derive"] }
33+
serde_json = { workspace = true }
34+
35+
# datafusion
36+
datafusion = { workspace = true }
37+
datafusion-expr = { workspace = true }
38+
datafusion-common = { workspace = true }
39+
datafusion-proto = { workspace = true }
40+
datafusion-sql = { workspace = true }
41+
datafusion-physical-expr = { workspace = true }
42+
43+
[dependencies.deltalake-core]
44+
path = "../deltalake-core"
45+
version = "0"
46+
features = ["datafusion"]

crates/benchmarks/README.md

+55
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# Merge
2+
The merge benchmarks are similar to the ones used by [Delta Spark](https://github.com/delta-io/delta/pull/1835).
3+
4+
5+
## Dataset
6+
7+
Databricks maintains a public S3 bucket of the TPC-DS dataset with various factor where requesters must pay to download this dataset. Below is an example of how to list the 1gb scale factor
8+
9+
```
10+
aws s3api list-objects --bucket devrel-delta-datasets --request-payer requester --prefix tpcds-2.13/tpcds_sf1_parquet/web_returns/
11+
```
12+
13+
You can generate the TPC-DS dataset yourself by downloading and compiling [the generator](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp)
14+
You may need to update the CFLAGS to include `-fcommon` to compile on newer versions of GCC.
15+
16+
## Commands
17+
These commands can be executed from the root of the benchmark crate. Some commands depend on the existance of the TPC-DS Dataset existing.
18+
19+
### Convert
20+
Converts a TPC-DS web_returns csv into a Delta table
21+
Assumes the dataset is pipe delimited and records do not have a trailing delimiter
22+
23+
```
24+
cargo run --release --bin merge -- convert data/tpcds/web_returns.dat data/web_returns
25+
```
26+
27+
### Standard
28+
Execute the standard merge bench suite.
29+
Results can be saved to a delta table for further analysis.
30+
This table has the following schema:
31+
32+
group_id: Used to group all tests that executed as a part of this call. Default value is the timestamp of execution
33+
name: The benchmark name that was executed
34+
sample: The iteration number for a given benchmark name
35+
duration_ms: How long the benchmark took in ms
36+
data: Free field to pack any additonal data
37+
38+
```
39+
cargo run --release --bin merge -- standard data/web_returns 1 data/merge_results
40+
```
41+
42+
### Compare
43+
Compare the results of two different runs.
44+
The a Delta table paths and the `group_id` of each run and obtain the speedup for each test case
45+
46+
```
47+
cargo run --release --bin merge -- compare data/benchmarks/ 1698636172801 data/benchmarks/ 1699759539902
48+
```
49+
50+
### Show
51+
Show all benchmarks results from a delta table
52+
53+
```
54+
cargo run --release --bin merge -- show data/benchmark
55+
```

0 commit comments

Comments
 (0)