You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: CONTRIBUTING.md
+75
Original file line number
Diff line number
Diff line change
@@ -11,3 +11,78 @@ If you want to contribute something more substantial, see our "Projects seeking
11
11
## Claiming an issue
12
12
13
13
If you want to claim an issue to work on, you can write the word `take` as a comment in it and you will be automatically assigned.
14
+
15
+
## Quick start
16
+
17
+
- Install Rust, e.g. as described [here](https://doc.rust-lang.org/cargo/getting-started/installation.html)
18
+
- Have a compatible Python version installed (check `python/pyproject.toml` for current requirement)
19
+
- Create a Python virtual environment (required for development builds), e.g. as described [here](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/)
20
+
- Build the project for development (this requires an active virtual environment and will also install `deltalake` in that virtual environment)
21
+
```
22
+
cd python
23
+
make develop
24
+
```
25
+
26
+
- Run some Python code, e.g. to run a specific test
*This serves your local contens of docs via a web browser, handy for checking what they look like if you are making changes to docs or docstings*
39
+
```
40
+
(cd python; make develop)
41
+
pip install -r docs/requirements.txt
42
+
mkdocs serve
43
+
```
44
+
45
+
## To make a pull request (PR)
46
+
- Make sure all the following steps run/pass locally before submitting a PR
47
+
```
48
+
cargo fmt -- --check
49
+
cd python
50
+
make check-rust
51
+
make check-python
52
+
make develop
53
+
make unit-test
54
+
make build-docs
55
+
```
56
+
57
+
## Developing in VSCode
58
+
59
+
*These are just some basic steps/components to get you started, there are many other very useful extensions for VSCode*
60
+
61
+
- For a better Rust development experience, install [rust extention](https://marketplace.visualstudio.com/items?itemName=1YiB.rust-bundle)
62
+
- For debugging Rust code, install [CodeLLDB](https://marketplace.visualstudio.com/items?itemName=vadimcn.vscode-lldb). The extension should even create Debug launch configurations for the project if you allow it, an easy way to get started. Just set a breakpoint and run the relevant configuration.
63
+
- For debugging from Python into Rust, follow this procedure:
64
+
1. Add this to `.vscode/launch.json`
65
+
```
66
+
{
67
+
"type": "lldb",
68
+
"request": "attach",
69
+
"name": "LLDB Attach to Python'",
70
+
"program": "${command:python.interpreterPath}",
71
+
"pid": "${command:pickMyProcess}",
72
+
"args": [],
73
+
"stopOnEntry": false,
74
+
"environment": [],
75
+
"externalConsole": true,
76
+
"MIMode": "lldb",
77
+
"cwd": "${workspaceFolder}"
78
+
}
79
+
```
80
+
2. Add a `breakpoint()` statement somewhere in your Python code (main function or at any point in Python code you know will be executed when you run it)
81
+
3. Add a breakpoint in Rust code in VSCode editor where you want to drop into the debugger
82
+
4. Run the relevant Python code function in your terminal, execution should drop into the Python debugger showing `PDB` prompt
83
+
5. Run the following in that promt to get the Python process ID: `import os; os.getpid()`
84
+
6. Run the `LLDB Attach to Python` from the `Run and Debug` panel of VSCode. This will prompt you for a Process ID to attach to, enter the Python process ID obtained earlier (this will also be in the dropdown but that dropdown will have many process IDs)
85
+
7. LLDB make take couple of seconds to attach to the process
86
+
8. When the debugger is attached to the process (you will notice the debugger panels get filled with extra info), enter `c`+Enter in the `PDB` prompt in your terminal - the execution should continue until the breakpoint in Rust code is hit. From this point it's a standard debugging procecess.
The merge benchmarks are similar to the ones used by [Delta Spark](https://github.com/delta-io/delta/pull/1835).
3
+
4
+
5
+
## Dataset
6
+
7
+
Databricks maintains a public S3 bucket of the TPC-DS dataset with various factor where requesters must pay to download this dataset. Below is an example of how to list the 1gb scale factor
You can generate the TPC-DS dataset yourself by downloading and compiling [the generator](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp)
14
+
You may need to update the CFLAGS to include `-fcommon` to compile on newer versions of GCC.
15
+
16
+
## Commands
17
+
These commands can be executed from the root of the benchmark crate. Some commands depend on the existance of the TPC-DS Dataset existing.
18
+
19
+
### Convert
20
+
Converts a TPC-DS web_returns csv into a Delta table
21
+
Assumes the dataset is pipe delimited and records do not have a trailing delimiter
22
+
23
+
```
24
+
cargo run --release --bin merge -- convert data/tpcds/web_returns.dat data/web_returns
25
+
```
26
+
27
+
### Standard
28
+
Execute the standard merge bench suite.
29
+
Results can be saved to a delta table for further analysis.
30
+
This table has the following schema:
31
+
32
+
group_id: Used to group all tests that executed as a part of this call. Default value is the timestamp of execution
33
+
name: The benchmark name that was executed
34
+
sample: The iteration number for a given benchmark name
35
+
duration_ms: How long the benchmark took in ms
36
+
data: Free field to pack any additonal data
37
+
38
+
```
39
+
cargo run --release --bin merge -- standard data/web_returns 1 data/merge_results
40
+
```
41
+
42
+
### Compare
43
+
Compare the results of two different runs.
44
+
The a Delta table paths and the `group_id` of each run and obtain the speedup for each test case
0 commit comments