-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal to create logical plans for operations #2006
Comments
Great writeup @Blajda - I am a great fan of going the logical plan route for all our operations. In another repo, I have been experimenting a bit with the end to end flow starting with parsing delta-specific SQL and will upstream some of these changes soon. Reading through your proposals, it seems the "find files" plan would essentially contain the logic from kernel and the concept of a Essentially the plan is to make out state less eager and hopefully improve processing performance along the way. The logic is as follows.
I think this would pay into the proposed As a "corollary" of this work we also have some progress on a much improved parquet reader the supports selective reading of leaf columns (also for nested structs) as well as more fine-granulary casting the schema - i.e. support schema evolution. Good news is, recently we have been consolidating the used APIs from the current Implicity your proposal also moves us to keep data much longer as RecordBatches and avoid crating actual As you said, this will be a lot of work, but also generate a lot of impact! |
Is this work available anywhere? To start I want to update I'm aligned with changes being made for |
To clarify about operations needing to clone |
So, part of #2095 I was suggesting was refactoring the builders to pass back their actions instead of doing everything in their own internal way. I'm in agreeance that moving to this method would help a lot for generalizing execution, but at a slightly higher level how do we think this would alter operation builders? Their public facing API obviously would not change, but internally they would pass back actions and an execution plan? Or does the execution plan also take care of actions and commit log updates? |
@hntd187 The execution plan would also take care of the actions and commit log updates at this point. In the above diagrams this would have been done in the Delta write relation however we can consider splitting write and commit into two different relation if it makes it easier. |
Well I suppose the logical plans don't need to care about the details of what has to happen with actions, but the physical ones will need to be able to pass that context along. I don't have a strong opinion about where the write and commit happen, just that you can compose the actions of various logical operators into a single planned write and commit. Hopefully that makes sense. |
# Description Some of my first workings on David's proposal in #2006, this is also meant to push #2048 and general CDF forward as well by making the logical operations of delta tables more composable than they are today. # Related Issue(s) #2006 #2048 I think and @Blajda correct me here, we can build upon this and eventually move towards a `DeltaPlanner` esq enum for operations and their associated logical plan building. # Still to do - [ ] Implement different path for partition columns that don't require scanning the file - [ ] Plumbing into `DeltaScan` so delta scan can make use of this logical node - [ ] General polish and cleanup, there are lots of unnecessary fields and way things are built - [ ] More tests, there is currently a large integration style end to end test, but this can / should be broken down
Description
Propose further work that I'd like to perform regarding the creation reusable logical relations. Also helps with identifying relations we would need with substrait.
Delta Find Files
Purpose: Identify files that contain records that satisfy a predicate.
This relation will generate a record batch stream with a single column called
path
.path
will then map to anAdd
action in the Delta table.This relation will also maintain a list of files that satisfy the predicate which can be passed sideways to relations downstream.
Delta Scan
Purpose: Scan the Delta Table
Update
DeltaScan
to take an optional input stream that contains paths of files to be scanned. This will enableDeltaScan
to consume output ofDeltaFindFile
.Currently when using find files, we must wait for the entire operation to complete and then we build the scan. The change enables Delta Scan to start when the first candidate file is identified.
I think this will require some significant work since it will involve refactoring the current DeltaScan implementation.
Delta Write
Purpose: Write records to storage, conflict resolution, and commit creation
Takes an single input stream of data that matches that tables schema and creates
Add
actions for each new file.Information can be passed sideways to include additional delta actions to add to the commit. E.G
DeltaDelete
can provide a stream ofRemove
actions.Delta Delete
Purpose: Delete Records from the table.
Given a predicate delete records from the Delta table.
Delta Delete can take an optional stream of records and will output records that do NOT satisfy the predicate.
It will maintain a stream of
Remove
actions can be passed sideways to other operations downstream.The input stream is optional since there are cases where delete determine which files to remove without a need for a scan. An optimization phase can help determine when this is the case.
Diagram
High level diagram of how these relation will connect.
Converting the
ReplaceWhere
operation to a logical view can look something like thisUse Case
Once we have logical plans for Update and Delete we can expose new Datafusion SQL statements for them
May help with reuse of Delete & Update other for logical plans.
Related Issue(s)
The text was updated successfully, but these errors were encountered: