Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Data Feed in Delta #2095

Closed
hntd187 opened this issue Jan 21, 2024 · 2 comments · Fixed by #2486
Closed

Change Data Feed in Delta #2095

hntd187 opened this issue Jan 21, 2024 · 2 comments · Fixed by #2486
Assignees
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate enhancement New feature or request

Comments

@hntd187
Copy link
Collaborator

hntd187 commented Jan 21, 2024

Description

Finally making an issue about this for discussion. I have a draft PR of reading CDF available already in #2048 which still needs work on it's reader side but I wanted to more discuss the writer side of it as it will take much more refactoring. I wanted to also ask how we wanted to approach this. The reader can be mostly encapsulated as it's own thing where the writer will touch all writing operations in delta lake. Do w e want to roll these into the same PR, or make subsequent PRs? I think subsequent would be better, but just my opinion.

So for reader, it's first stages are in-flight with #2048, I just need to figure out how I want to validate the correctness of this. @MrPowers maybe you can help me figure that out?

For writers, well there is a bit more to do here. CDC actions have to be added to the commit log along with the subsequent add/remove actions, generating additional change data files in the _change_data directory of a delta table. Currently we encapsulate the builders of these operations in such a way that the builder builds and commits all the actions itself without giving any ability for features to influence the actions of the commit before it's written. So, in order to make CDF work we would be required to update every action and add it's subsequent CDF aware functionality to the operation. I'd argue this would only exacerbate the current issue with builders owning the entire life cycle of the operation and we should not do this.

I would instead suggest that we refactor the builders to only create and return a list of actions to commit and a snapshot to commit to. Then let a subsequent (maybe global) part of the code do the actual commit. This way you can compose operations in a more maintainable way. I spoke with @r3stl355 about this as well because some of the work he did for replaceWhere would have been another good candidate to benefit from this type of rethinking. So for writers I am proposing we take this approach.

  1. Refactor builders internals to build sets of actions to commit and not perform the actual commit.
  2. Add a central place to actually perform commit / write / update the table
  3. Implement subsequent CDF compositions on top of the new operations

This will benefit us in the sense that CDF's implementation will have no effect on what those other operations do. Only augment them, so generally our implementations of these features will be more resilient to mistakes as we implement more features down the line. Things like row-tracking come to mind when thinking about potential issues down the line as row-tracking has a specific clause for readers regarding CDF files. Additionally checkpoints must specifically go the opposite way and remove CDF from their checkpointing. I linked these under the related issues, but hopefully that makes sense.

Use Case
https://delta.io/blog/2023-07-14-delta-lake-change-data-feed-cdf/
https://docs.delta.io/latest/delta-change-data-feed.html
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-cdc-file

Related Issue(s)
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#reader-requirements-for-row-tracking
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#checkpoints-1

@hntd187 hntd187 added the enhancement New feature or request label Jan 21, 2024
@hntd187
Copy link
Collaborator Author

hntd187 commented Jan 21, 2024

take

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Jan 21, 2024

I think that ties in well what @Blajda proposes where he is also pointing out to reuse components across the different operations #2006. Maybe it's good to move to logical plans first throughout the API before we start working on CDF then

@ion-elgreco ion-elgreco added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Jan 25, 2024
@rtyler rtyler added this to the Rust v0.18 milestone Feb 6, 2024
@rtyler rtyler self-assigned this Feb 26, 2024
rtyler added a commit to rtyler/delta-rs that referenced this issue May 7, 2024
This change introduces a `CDCTracker` which helps collect changes during
merges and update. This is admittedly rather inefficient, but my hope is
that this provides a place to start iterating and improving upon the
writer code

There is still additional work which needs to be done to handle table
features properly for other code paths (see the middleware discussion we
have had in Slack) but this produces CDC files for Update operations

Fixes delta-io#604
Fixes delta-io#2095
rtyler added a commit to rtyler/delta-rs that referenced this issue May 12, 2024
This change introduces a `CDCTracker` which helps collect changes during
merges and update. This is admittedly rather inefficient, but my hope is
that this provides a place to start iterating and improving upon the
writer code

There is still additional work which needs to be done to handle table
features properly for other code paths (see the middleware discussion we
have had in Slack) but this produces CDC files for Update operations

Fixes delta-io#604
Fixes delta-io#2095
rtyler added a commit to rtyler/delta-rs that referenced this issue May 17, 2024
This change introduces a `CDCTracker` which helps collect changes during
merges and update. This is admittedly rather inefficient, but my hope is
that this provides a place to start iterating and improving upon the
writer code

There is still additional work which needs to be done to handle table
features properly for other code paths (see the middleware discussion we
have had in Slack) but this produces CDC files for Update operations

Fixes delta-io#604
Fixes delta-io#2095
rtyler added a commit to rtyler/delta-rs that referenced this issue May 21, 2024
This change introduces a `CDCTracker` which helps collect changes during
merges and update. This is admittedly rather inefficient, but my hope is
that this provides a place to start iterating and improving upon the
writer code

There is still additional work which needs to be done to handle table
features properly for other code paths (see the middleware discussion we
have had in Slack) but this produces CDC files for Update operations

Fixes delta-io#604
Fixes delta-io#2095
rtyler added a commit to rtyler/delta-rs that referenced this issue May 29, 2024
This change introduces a `CDCTracker` which helps collect changes during
merges and update. This is admittedly rather inefficient, but my hope is
that this provides a place to start iterating and improving upon the
writer code

There is still additional work which needs to be done to handle table
features properly for other code paths (see the middleware discussion we
have had in Slack) but this produces CDC files for Update operations

Fixes delta-io#604
Fixes delta-io#2095
rtyler added a commit to rtyler/delta-rs that referenced this issue May 29, 2024
This change introduces a `CDCTracker` which helps collect changes during
merges and update. This is admittedly rather inefficient, but my hope is
that this provides a place to start iterating and improving upon the
writer code

There is still additional work which needs to be done to handle table
features properly for other code paths (see the middleware discussion we
have had in Slack) but this produces CDC files for Update operations

Fixes delta-io#604
Fixes delta-io#2095
rtyler added a commit to rtyler/delta-rs that referenced this issue Jun 1, 2024
This change introduces a `CDCTracker` which helps collect changes during
merges and update. This is admittedly rather inefficient, but my hope is
that this provides a place to start iterating and improving upon the
writer code

There is still additional work which needs to be done to handle table
features properly for other code paths (see the middleware discussion we
have had in Slack) but this produces CDC files for Update operations

Fixes delta-io#604
Fixes delta-io#2095
rtyler added a commit to rtyler/delta-rs that referenced this issue Jun 3, 2024
This change introduces a `CDCTracker` which helps collect changes during
merges and update. This is admittedly rather inefficient, but my hope is
that this provides a place to start iterating and improving upon the
writer code

There is still additional work which needs to be done to handle table
features properly for other code paths (see the middleware discussion we
have had in Slack) but this produces CDC files for Update operations

Fixes delta-io#604
Fixes delta-io#2095
ion-elgreco pushed a commit to rtyler/delta-rs that referenced this issue Jun 4, 2024
This change introduces a `CDCTracker` which helps collect changes during
merges and update. This is admittedly rather inefficient, but my hope is
that this provides a place to start iterating and improving upon the
writer code

There is still additional work which needs to be done to handle table
features properly for other code paths (see the middleware discussion we
have had in Slack) but this produces CDC files for Update operations

Fixes delta-io#604
Fixes delta-io#2095
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants