Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove expression.rs file and use Arrow DataFusion Python version #1139

Draft
wants to merge 63 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
fa18499
Merge remote-tracking branch 'upstream/main'
jdye64 Apr 13, 2023
d6b470c
Merge remote-tracking branch 'upstream/main'
jdye64 May 2, 2023
856f7c0
Merge remote-tracking branch 'upstream/main'
jdye64 May 8, 2023
0e8e96c
Remove local expression.rs file and refactor codebase to use datafusi…
jdye64 May 8, 2023
f8b52f8
Merge branch 'main' into drop_expr
jdye64 May 9, 2023
2cff5e0
Fix issues from previous merge introduced
jdye64 May 9, 2023
29c48b6
Merge branch 'main' into drop_expr
jdye64 May 9, 2023
065471e
Uncomment section for time types
jdye64 May 9, 2023
7a9906f
Update to use Option<Arc<str>> and us unsafe Arc::from_raw() method
jdye64 May 9, 2023
3a7754d
Use into() instead of unsafe Arc::from_raw()
jdye64 May 9, 2023
435db42
partial refactoring checkpoint
jdye64 May 9, 2023
daa0dcd
bump arrow datafusion python version
jdye64 May 10, 2023
adaf7c1
merge upstream/main
jdye64 May 10, 2023
98938d9
Checkpoint, aggregations working
jdye64 May 11, 2023
48fbc58
Checkpoint, show and sort working
jdye64 May 11, 2023
34ee9e3
Merge branch 'main' into drop_expr
jdye64 May 11, 2023
1464bf6
Checkpoint, distributeby working and partial join
jdye64 May 11, 2023
1e94c01
Checkpoint, window logic and pytest passing
jdye64 May 11, 2023
25bc867
checkpoint, test_schema.py tests all passing
jdye64 May 11, 2023
24350b5
Merge remote-tracking branch 'upstream/main' into drop_expr
jdye64 May 25, 2023
4b86547
Merge remote-tracking branch 'upstream/main'
jdye64 May 30, 2023
28fce59
Merge remote-tracking branch 'upstream/main'
jdye64 Jun 14, 2023
6637b84
Bump ADP -> 26.0.0
jdye64 Jun 14, 2023
c59cdbd
warn on optimization failure instead of erroring and exiting
jdye64 Jun 14, 2023
79a1f7c
Merge branch 'main' into adp_26
ayushdg Jun 20, 2023
ba585a5
Merge branch 'main' into adp_26
ayushdg Jun 26, 2023
23af11d
Merge remote-tracking branch 'origin/main' into adp_26
charlesbluca Jun 30, 2023
5c02c5a
Resolve initial build errors
charlesbluca Jun 30, 2023
7c36bf5
Merge remote-tracking branch 'origin/main' into adp_26
charlesbluca Jul 7, 2023
515dae6
Switch to crates release, add zlib to host/build deps
charlesbluca Jul 7, 2023
ef399e8
Add zlib to aarch build deps
charlesbluca Jul 7, 2023
8fbd1ff
Merge remote-tracking branch 'origin/main' into adp_26
charlesbluca Jul 7, 2023
bca9911
Merge branch 'main' into adp_26
jdye64 Jul 10, 2023
6858578
Bump to ADP 27 and introduce support for wildcard expressions, a wild…
jdye64 Jul 12, 2023
24e0f90
remove bit of logic that is no longer needed to manually check the wi…
jdye64 Jul 12, 2023
d776229
experiment with removing zlib, hoping that fixes os x build
jdye64 Jul 12, 2023
99ec801
Change expected_df result to 1.5 from 1. 3/2 is in fact 1.5 and not 1
jdye64 Jul 12, 2023
8997f7f
Fix cargo test
jdye64 Jul 12, 2023
45b5f42
merge with upstream ADP 27 bump PR
jdye64 Jul 13, 2023
ec1d2bf
add .cargo/config.toml in hopes of fixing linker build issues on osx
jdye64 Jul 13, 2023
379a978
add .cargo/config.toml in hopes of fixing linker build issues on osx
jdye64 Jul 13, 2023
e030bef
Remove extra config.toml
charlesbluca Jul 13, 2023
b2e85df
Try overriding runner-installed toolchain
charlesbluca Jul 14, 2023
d01088d
Revert "Try overriding runner-installed toolchain"
charlesbluca Jul 14, 2023
ca70f0f
Initial migration to maturin build system
charlesbluca Jul 17, 2023
d900f0e
Make some modifications to Rust package name
charlesbluca Jul 17, 2023
7d1be92
Adjust native library name from _.internal to dask_planner
jdye64 Jul 17, 2023
83fb5c3
Resolve initial conda build issues
charlesbluca Jul 17, 2023
c7bbbd7
Replace setuptools-rust with maturin in CI
charlesbluca Jul 17, 2023
6dc6347
Constrain maturin, remove setuptools-rust from CI envs
charlesbluca Jul 17, 2023
6dcf5e0
Update docs and Rust CI
charlesbluca Jul 17, 2023
b7c02c9
Remove more dask_planner appearances
charlesbluca Jul 17, 2023
a3e1a68
Bump pyarrow min version to resolve 3.8 conflicts
charlesbluca Jul 17, 2023
3ff8240
test commit seeing how CI will respond without cmd_loop import
jdye64 Jul 17, 2023
ce56a08
Merge branch 'adp_26' of github.com:jdye64/dask-sql into adp_26
charlesbluca Jul 17, 2023
ae7a3d6
Rename module to _datafusion_lib
charlesbluca Jul 17, 2023
0c2908c
Switch to maturin develop for CI installs
charlesbluca Jul 17, 2023
849dc42
Fix failing cargo tests, changed output, from datafusion version bump
jdye64 Jul 18, 2023
1f73b56
Fix cargo test syntax issue
jdye64 Jul 18, 2023
79b6eac
Fix failing Rust tests
Jul 18, 2023
405470f
Remove linux config.toml options
jdye64 Jul 18, 2023
ca5c093
Merge with upstream/adp_26 branch
jdye64 Jul 19, 2023
bdfad04
Checkpoint commit
jdye64 Jul 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
File renamed without changes.
5 changes: 4 additions & 1 deletion .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,7 @@
* @ayushdg @charlesbluca @galipremsagar

# rust codeowners
dask_planner/ @ayushdg @charlesbluca @galipremsagar @jdye64
.cargo/ @ayushdg @charlesbluca @galipremsagar @jdye64
src/ @ayushdg @charlesbluca @galipremsagar @jdye64
Cargo.toml @ayushdg @charlesbluca @galipremsagar @jdye64
Cargo.lock @ayushdg @charlesbluca @galipremsagar @jdye64
7 changes: 3 additions & 4 deletions .github/workflows/conda.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,9 @@ on:
pull_request:
paths:
- setup.py
- dask_planner/Cargo.toml
- dask_planner/Cargo.lock
- dask_planner/pyproject.toml
- dask_planner/rust-toolchain.toml
- Cargo.toml
- Cargo.lock
- pyproject.toml
- continuous_integration/recipe/**
- .github/workflows/conda.yml
schedule:
Expand Down
8 changes: 4 additions & 4 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,21 +60,21 @@ jobs:
CARGO_NET_GIT_FETCH_WITH_CLI="true"
PATH="$HOME/.cargo/bin:$HOME/.local/bin:$PATH"
CIBW_ENVIRONMENT_WINDOWS: 'PATH="$UserProfile\.cargo\bin;$PATH"'
CIBW_BEFORE_BUILD: 'pip install -U setuptools-rust'
CIBW_BEFORE_BUILD: 'pip install -U "maturin>=0.15,<0.16"'
CIBW_BEFORE_BUILD_LINUX: >
ARCH=$([ $(uname -m) == x86_64 ] && echo x86_64 || echo aarch_64) &&
DOWNLOAD_URL=$(curl --retry 6 --retry-delay 10 -s https://api.github.com/repos/protocolbuffers/protobuf/releases/latest | grep -o '"browser_download_url": "[^"]*' | cut -d'"' -f4 | grep "\linux-${ARCH}.zip$") &&
curl --retry 6 --retry-delay 10 -LO $DOWNLOAD_URL &&
unzip protoc-*-linux-$ARCH.zip -d $HOME/.local &&
protoc --version &&
pip install -U setuptools-rust &&
pip install -U "maturin>=0.15,<0.16" &&
pip list &&
curl --retry 6 --retry-delay 10 https://sh.rustup.rs -sSf | sh -s -- --default-toolchain=stable --profile=minimal -y &&
rustup show
with:
package-dir: .
output-dir: dist
config-file: "dask_planner/pyproject.toml"
config-file: "pyproject.toml"
- name: Set up Python
uses: conda-incubator/[email protected]
with:
Expand Down Expand Up @@ -127,7 +127,7 @@ jobs:
channel-priority: strict
- name: Build source distribution
run: |
mamba install setuptools-rust twine
mamba install "maturin>=0.15,<0.16" twine

python setup.py sdist
- name: Check dist files
Expand Down
5 changes: 0 additions & 5 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,6 @@ jobs:
- name: Optionally update upstream dependencies
if: needs.detect-ci-trigger.outputs.triggered == 'true'
run: |
cd dask_planner
bash update-dependencies.sh
- name: Install Protoc
uses: arduino/setup-protoc@v1
Expand All @@ -60,11 +59,9 @@ jobs:
repo-token: ${{ secrets.GITHUB_TOKEN }}
- name: Check workspace in debug mode
run: |
cd dask_planner
cargo check
- name: Check workspace in release mode
run: |
cd dask_planner
cargo check --release

# test the crate
Expand All @@ -84,7 +81,6 @@ jobs:
- name: Optionally update upstream dependencies
if: needs.detect-ci-trigger.outputs.triggered == 'true'
run: |
cd dask_planner
bash update-dependencies.sh
- name: Install Protoc
uses: arduino/setup-protoc@v1
Expand All @@ -93,5 +89,4 @@ jobs:
repo-token: ${{ secrets.GITHUB_TOKEN }}
- name: Run tests
run: |
cd dask_planner
cargo test
5 changes: 1 addition & 4 deletions .github/workflows/test-upstream.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,11 +68,10 @@ jobs:
- name: Optionally update upstream cargo dependencies
if: env.which_upstream == 'DataFusion'
run: |
cd dask_planner
bash update-dependencies.sh
- name: Build the Rust DataFusion bindings
run: |
python setup.py build install
maturin develop
- name: Install hive testing dependencies
if: matrix.os == 'ubuntu-latest'
run: |
Expand Down Expand Up @@ -122,11 +121,9 @@ jobs:
env:
UPDATE_ALL_CARGO_DEPS: false
run: |
cd dask_planner
bash update-dependencies.sh
- name: Install dependencies and nothing else
run: |
mamba install setuptools-rust
pip install -e . -vv

which python
Expand Down
3 changes: 1 addition & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ jobs:
shared-key: test
- name: Build the Rust DataFusion bindings
run: |
python setup.py build install
maturin develop
- name: Install hive testing dependencies
if: matrix.os == 'ubuntu-latest'
run: |
Expand Down Expand Up @@ -116,7 +116,6 @@ jobs:
repo-token: ${{ secrets.GITHUB_TOKEN }}
- name: Install dependencies and nothing else
run: |
mamba install "setuptools-rust>=1.5.2"
pip install -e . -vv

which python
Expand Down
10 changes: 1 addition & 9 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -46,23 +46,15 @@ venv
# IDE
.idea
.vscode
planner/.classpath
planner/.project
planner/.settings/
planner/.idea
planner/*.iml
*.swp

# project specific
planner/dependency-reduced-pom.xml
planner/target/
dask_sql/jar
.next/
dask-worker-space/
node_modules/
docs/source/_build/
tests/unit/queries
tests/unit/data
target/*

# Ignore development specific local testing files
dev_tests
Expand Down
6 changes: 3 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@ repos:
rev: v1.0
hooks:
- id: cargo-check
args: ['--manifest-path', './dask_planner/Cargo.toml', '--verbose', '--']
args: ['--manifest-path', './Cargo.toml', '--verbose', '--']
- id: clippy
args: ['--manifest-path', './dask_planner/Cargo.toml', '--verbose', '--', '-D', 'warnings']
args: ['--manifest-path', './Cargo.toml', '--verbose', '--', '-D', 'warnings']
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.2.0
hooks:
Expand All @@ -39,4 +39,4 @@ repos:
entry: cargo +nightly fmt
language: system
types: [rust]
args: ['--manifest-path', './dask_planner/Cargo.toml', '--verbose', '--']
args: ['--manifest-path', './Cargo.toml', '--verbose', '--']
20 changes: 10 additions & 10 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,15 +49,15 @@ Note that while `setuptools-rust` is used by CI and should be used during your d
Building Dask-SQL is straightforward with Python. To build run ```python setup.py install```. This will build both the Rust and Python codebase and install it into your locally activated conda environment. While not required, if you have updated dependencies for Rust you might prefer a clean build. To clean your setup run ```python setup.py clean``` and then run ```python setup.py install```

#### DataFusion Modules
DataFusion is broken down into a few modules. We consume those modules in our [Cargo.toml](dask_planner/Cargo.toml). The modules that we use currently are
DataFusion is broken down into a few modules. We consume those modules in our [Cargo.toml](Cargo.toml). The modules that we use currently are

- `datafusion-common` - Datastructures and core logic
- `datafusion-expr` - Expression based logic and operators
- `datafusion-sql` - SQL components such as parsing and planning
- `datafusion-optimizer` - Optimization logic and datastructures for modifying current plans into more efficient ones.

#### Retrieving Upstream Dependencies
During development you might find yourself needing some upstream DataFusion changes not present in the projects current version. Luckily this can easily be achieved by updating [Cargo.toml](dask_planner/Cargo.toml) and changing the `rev` to the SHA of the version you need. Note that the same SHA should be used for all DataFusion modules.
During development you might find yourself needing some upstream DataFusion changes not present in the projects current version. Luckily this can easily be achieved by updating [Cargo.toml](Cargo.toml) and changing the `rev` to the SHA of the version you need. Note that the same SHA should be used for all DataFusion modules.

After updating the `Cargo.toml` file the codebase can be re-built to reflect those changes by running `python setup.py install`

Expand All @@ -72,40 +72,40 @@ Sometimes when building against the latest Github commits for DataFusion you may
### Datastructures
While working in the Rust codebase there are a few datastructures that you should make yourself familiar with. This section does not aim to verbosely list out all of the datastructure with in the project but rather just the key datastructures that you are likely to encounter while working on almost any feature/issue. The aim is to give you a better overview of the codebase without having to manually dig through the all the source code.

- [`PyLogicalPlan`](dask_planner/src/sql/logical.rs) -> [DataFusion LogicalPlan](https://docs.rs/datafusion/latest/datafusion/logical_plan/enum.LogicalPlan.html)
- [`PyLogicalPlan`](src/sql/logical.rs) -> [DataFusion LogicalPlan](https://docs.rs/datafusion/latest/datafusion/logical_plan/enum.LogicalPlan.html)
- Often encountered in Python code with variable name `rel`
- Python serializable umbrella representation of the entire LogicalPlan that was generated by DataFusion
- Provides access to `DaskTable` instances and type information for each table
- Access to individual nodes in the logical plan tree. Ex: `TableScan`
- [`DaskSQLContext`](dask_planner/src/sql.rs)
- [`DaskSQLContext`](src/sql.rs)
- Analogous to Python `Context`
- Contains metadata about the tables, schemas, functions, operators, and configurations that are persent within the current execution context
- When adding custom functions/UDFs this is the location that you would register them
- Entry point for parsing SQL strings to sql node trees. This is the location Python will begin its interactions with Rust
- [`PyExpr`](dask_planner/src/expression.rs) -> [DataFusion Expr](https://docs.rs/datafusion/latest/datafusion/prelude/enum.Expr.html)
- [`PyExpr`](src/expression.rs) -> [DataFusion Expr](https://docs.rs/datafusion/latest/datafusion/prelude/enum.Expr.html)
- Arguably where most of your time will be spent
- Represents a single node in sql tree. Ex: `avg(age)` from `SELECT avg(age) FROM people`
- Is associate with a single `RexType`
- Can contain literal values or represent function calls, `avg()` for example
- The expressions "index" in the tree can be retrieved by calling `PyExpr.index()` on an instance. This is useful when mapping frontend column names in Dask code to backend Dataframe columns
- Certain `PyExpr`s contain operands. Ex: `2 + 2` would contain 3 operands. 1) A literal `PyExpr` instance with value 2 2) Another literal `PyExpr` instance with a value of 2. 3) A `+` `PyExpr` representing the addition of the 2 literals.
- [`DaskSqlOptimizer`](dask_planner/src/sql/optimizer.rs)
- [`DaskSqlOptimizer`](src/sql/optimizer.rs)
- Registering location for all Dask-SQL specific logical plan optimizations
- Optimizations that are written either custom or use from another source, DataFusion, are registered here in the order they are wished to be executed
- Represents functions that modify/convert an original `PyLogicalPlan` into another `PyLogicalPlan` that would be more efficient when running in the underlying Dask framework
- [`RelDataType`](dask_planner/src/sql/types/rel_data_type.rs)
- [`RelDataType`](src/sql/types/rel_data_type.rs)
- Not a fan of this name, was chosen to match existing Calcite logic
- Represents a "row" in a table
- Contains a list of "columns" that are present in that row
- [RelDataTypeField](dask_planner/src/sql/types/rel_data_type_field.rs)
- [RelDataTypeField](dask_planner/src/sql/types/rel_data_type_field.rs)
- [RelDataTypeField](src/sql/types/rel_data_type_field.rs)
- [RelDataTypeField](src/sql/types/rel_data_type_field.rs)
- Represents an individual column in a table
- Contains:
- `qualifier` - schema the field belongs to
- `name` - name of the column/field
- `data_type` - `DaskTypeMap` instance containing information about the SQL type and underlying Arrow DataType
- `index` - location of the field in the LogicalPlan
- [DaskTypeMap](dask_planner/src/sql/types.rs)
- [DaskTypeMap](src/sql/types.rs)
- Maps a conventional SQL type to an underlying Arrow DataType


Expand Down
Loading