Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read transforms via expressions. Part 1: Just compute the expression and return it. #607

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

nicklan
Copy link
Collaborator

@nicklan nicklan commented Dec 18, 2024

What changes are proposed in this pull request?

This is the initial part of moving to using expressions to express transformations when reading data. What this PR does is:

  • Compute a "static" transform, which is just a set of column expressions that need to be passed directly through without change, or enough metadata for lower levels to fill in a "fixup" expression
  • The static transform is passed into the iterator that parses each Add file
  • When parsing the Add file, if there are needed fix-ups (just partition columns today), the correct expression is created, and inserted into a row indexed map
  • This map is returned so the caller can find out for a given row what, if any, expression needs to be applied when reading the specified row

Follow-up PRs will:

  1. Propagate this information though when using visit_scan_files
  2. Actually use the data to do transformation
  3. Remove transform_to_logical entirely and clean up associated code

Each of those are more invasive and end up touching significant code, so I'm staging this as much as possible to make reviews easier.

How was this change tested?

Unit tests, and inspection of resultant expressions when run on tables

@nicklan nicklan requested review from zachschuermann, scovich and OussamaSaoudi-db and removed request for zachschuermann and scovich December 18, 2024 20:54
@github-actions github-actions bot added the breaking-change Change that will require a version bump label Dec 18, 2024
}
}

/// Given an iterator of (engine_data, bool) tuples and a predicate, returns an iterator of
/// `(engine_data, selection_vec)`. Each row that is selected in the returned `engine_data` _must_
/// be processed to complete the scan. Non-selected rows _must_ be ignored. The boolean flag
/// indicates whether the record batch is a log or checkpoint batch.
pub fn scan_action_iter(
pub(crate) fn scan_action_iter(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is a significant change as we not longer expose this function. In discussion so far we've agreed that it basically should never have been pub, and I just made a mistake when doing so. An engine should call scan_data which mostly just proxies to this, but doesn't expose internal details to the engine.

Open to discussion though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub(crate) SGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants