Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implementation for replaceWhere #1996

Merged
merged 5 commits into from
Jan 31, 2024
Merged

Conversation

r3stl355
Copy link
Contributor

@r3stl355 r3stl355 commented Dec 28, 2023

Description

First/naive implementation of replaceWhere for write. Code compiles and there is a test to verify the outcome. I would appreciate any feedback on improving the structure/implementation. For example, I copied the part of code from delete operation because there is no way to call that code in delete directly from write - should I look into extracting that code from delete to somewhere central?

Seems to also works with partitions columns.

Related Issue(s)

#1957

Documentation

Added a section in docs

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate crate/core labels Dec 28, 2023
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@r3stl355 r3stl355 changed the title WIP: draft implementation for replaceWhere feat: draft implementation for replaceWhere Dec 28, 2023
@r3stl355 r3stl355 changed the title feat: draft implementation for replaceWhere feat: implementation for replaceWhere Dec 28, 2023
@r3stl355
Copy link
Contributor Author

r3stl355 commented Dec 28, 2023

It is passing the CI on my box and with added docs I'm removing the draft once CI is passing here (one of the Python tests is flaky, need to be fixed)

@r3stl355
Copy link
Contributor Author

looks like few extra checks were introduced recently so lint is picking up changes which are not mine, e.g. crates/deltalake-core/src/delta_datafusion/mod.rs:149:37

@ion-elgreco
Copy link
Collaborator

looks like few extra checks were introduced recently so lint is picking up changes which are not mine, e.g. crates/deltalake-core/src/delta_datafusion/mod.rs:149:37

Could you create a separate PR to address that so main is fixed?

@r3stl355
Copy link
Contributor Author

Could you create a separate PR to address that so main is fixed?

Ahh, fixed and pushed here before seeing your response. Can do another PR if they still break but leaving that for another day, done for today

@r3stl355
Copy link
Contributor Author

Looks like I am close to removing the draft, only need to fix this flakytests/test_writer.py::test_replace_where_overwrite - sometimes it fails, sometimes not even on my machine

@Blajda
Copy link
Collaborator

Blajda commented Dec 30, 2023

I copied the part of code from delete operation because there is no way to call that code in delete directly from write - should I look into extracting that code from delete to somewhere central?

I think this should be done. Delete is one of the things we should keep DRY which will help when deletion vectors are implemented. All operations will benefit instead of needing to update code in different location.

My 2 cents is that we shouldn't modify the writer to implement this operation but rather should have a brand new operation structured in a manner similar to merge, delete, and update.

Some builder interface like below

DeltaOps(table)
    .replace_where(predicate, source_data)
    .with_check_constraint(false)
    .await?

@r3stl355
Copy link
Contributor Author

r3stl355 commented Dec 30, 2023

My 2 cents is that we shouldn't modify the writer to implement this operation but rather should have a brand new operation structured in a manner similar to merge, delete, and update.

@Blajda I understand that you suggesting a standalone operator just for replaceWhere, correct? Which means:

  1. Removing a predicate from write, which is fine as it wasn't implemented anyway
  2. Exposing a function fromdelete which behaves similar to writer::write_execution_plan by returning a set of actions
  3. [MAYBE] Possibly exposing a version of write_execution_plan function that takes a predicate from write

If this is what you were suggesting I can give it a try

@Blajda
Copy link
Collaborator

Blajda commented Dec 30, 2023

@r3stl355 Yes I agree with point 1. For 2. one approach is to factor out a function that returns a tuple of files to be removed and an optional stream of record batches. I'm not sure on what 3. would provide.

@r3stl355
Copy link
Contributor Author

@r3stl355 Yes I agree with point 1. For 2. one approach is to factor out a function that returns a tuple of files to be removed and an optional stream of record batches. I'm not sure on what 3. would provide.

The 3. would allow to check if the new data conforms to the replaceWhere predicate. Current implementation of the publicwrite_execution_plan does not do that and if I don't exposed a public function checking that the new data conforms to a predicate (e.g. write_execution_plan_with_predicate) then I would have to implement it within the replaceWhere op effectively copying the code

@ion-elgreco
Copy link
Collaborator

I'm not sure if I agree with taking this out of write. It's still a write operation

@r3stl355 r3stl355 marked this pull request as ready for review January 2, 2024 21:48
@MrPowers
Copy link
Contributor

MrPowers commented Jan 2, 2024

What's the user-facing public interface for Python?

@r3stl355
Copy link
Contributor Author

r3stl355 commented Jan 2, 2024

@MrPowers Good old write_deltalake with predicate parameter. It was already there before this PR but was not implemented. All should be working with this PR

I also added docs and examples in docs for both Python and Rust (https://github.com/delta-io/delta-rs/pull/1996/files#diff-681ad64170174eef8ddc2c987ca2e99af32d90a79720b50b3f575a4a00ec4a50R55)

@MrPowers
Copy link
Contributor

MrPowers commented Jan 2, 2024

@r3stl355 - ah, got it. Here's the delta-spark syntax:

(
    df.write.format("delta")
    .option("replaceWhere", "number > 2")
    .mode("overwrite")
    .save("tmp/my_data")
)

Looks like the proposed syntax is as follows:

write_deltalake(
    table_path,
    data,
    mode="overwrite",
    predicate="id = '1'",
    engine="rust",
)

I almost feel like the mode should be replace_where. Thoughts?

@ion-elgreco
Copy link
Collaborator

@MrPowers I don't think it makes sense to introduce another mode because replace where without a predicate wouldnt be possible

Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add one more test cases to match the pyarrow partition filter overwrite test cases? :)

A test case where the data is not meeting the constraint. Also I wonder if you could capture that constraint error and give a better error msg

crates/deltalake-core/src/operations/write.rs Outdated Show resolved Hide resolved
crates/deltalake-core/src/operations/write.rs Outdated Show resolved Hide resolved
docs/usage/writing-delta-tables.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took so long, but finally got around to do some reviewing.

I am a bit less happy with my code now after snapshot was made an Option

This is actually also one of my biggest concerns annoyances right now, and kind of boild down to what purpose each of these structs serve in our codebase. Previously, DeltaTableState took no opinion on if Metadata and Protocol exist. However if we cannot get these, there really is not anything meaningful we can do with a delta table. However DeltaTable has some operations that work on empty locations etc. As a result, we were raising errors all throughout our operations if the back then optionsal metadata / protocol actions were not available. While this requires some more refactoring, all operations (except the ones that can cretate a table) should always require a valid snapshot.

btw. DeltaTableState is by now also just a thin wrapper around a snapshot, which I also hope to remove soon :).

crates/deltalake-core/src/operations/write.rs Outdated Show resolved Hide resolved
crates/deltalake-core/src/operations/write.rs Outdated Show resolved Hide resolved
crates/deltalake-core/src/operations/write.rs Outdated Show resolved Hide resolved
crates/deltalake-core/src/operations/write.rs Outdated Show resolved Hide resolved
crates/deltalake-core/src/operations/write.rs Outdated Show resolved Hide resolved
crates/deltalake-core/src/operations/write.rs Outdated Show resolved Hide resolved
crates/deltalake-core/src/operations/write.rs Outdated Show resolved Hide resolved
@r3stl355
Copy link
Contributor Author

Thank you for the review @roeap

roeap
roeap previously approved these changes Jan 28, 2024
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience with this PR @r3stl355, and great work!

I'll leave it open for a bit, since @MrPowers had some questions around the APIs, but form my end passing the predicate to write_deltalake seems reasonable. While this function already takes a lot of parameters, I feel this PR does not make it worse :).

@ion-elgreco
Copy link
Collaborator

@roeap once we drop the Pyarrow writer the write_deltalake will be much leaner

@r3stl355
Copy link
Contributor Author

this PR does not make it worse :).

minimum standard achieved 😁 . I'll create couple of related issues

@r3stl355
Copy link
Contributor Author

@MrPowers do you still have those questions around API? If not then this should be ready to merge (and I squashed all my commits into one as agreed)

@roeap
Copy link
Collaborator

roeap commented Jan 29, 2024

@r3stl355 - it seems we have seen a bigger rename of files before I could get that in :( - sorry oyu have to resolve once more.

Since we haven't heard back, I would just merge once we are green again ...

Signed-off-by: Nikolay Ulmasov <[email protected]>
@r3stl355
Copy link
Contributor Author

Thanks @roeap , should be good now

@roeap roeap enabled auto-merge (squash) January 30, 2024 09:36
ion-elgreco
ion-elgreco previously approved these changes Jan 30, 2024
@r3stl355
Copy link
Contributor Author

It's been back and forth few times, something that I understand and don't mind but unfortunately I don't have time to handle this in next few weeks so parking it

@r3stl355 r3stl355 disabled auto-merge January 30, 2024 22:04
@r3stl355 r3stl355 marked this pull request as draft January 30, 2024 22:06
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 31, 2024
@ion-elgreco ion-elgreco marked this pull request as ready for review January 31, 2024 07:51
Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Niko! Great feature :) We are one step closer to deprecating the PyArrow writer now!

@ion-elgreco ion-elgreco enabled auto-merge (squash) January 31, 2024 07:52
@ion-elgreco ion-elgreco merged commit b107407 into delta-io:main Jan 31, 2024
23 checks passed
RobinLin666 pushed a commit to RobinLin666/delta-rs that referenced this pull request Feb 2, 2024
# Description
First/naive implementation of `replaceWhere` for `write`. Code compiles
and there is a test to verify the outcome. I would appreciate any
feedback on improving the structure/implementation. For example, I
copied the part of code from `delete` operation because there is no way
to call that code in `delete` directly from `write` - should I look into
extracting that code from `delete` to somewhere central?

Seems to also works with partitions columns.

# Related Issue(s)
delta-io#1957

# Documentation
Added a section in docs

---------

Signed-off-by: Nikolay Ulmasov <[email protected]>
Co-authored-by: Ion Koutsouris <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants