Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify SchemaTransform API #531

Merged
merged 2 commits into from
Nov 26, 2024
Merged

Conversation

scovich
Copy link
Collaborator

@scovich scovich commented Nov 22, 2024

What changes are proposed in this pull request?

Change SchemaTransform to have a trait-level named lifetime and for its methods to take &'a Foo instead of Cow<'a, Foo>. Not only was Cow unnecessary there, it also caused borrow checker problems (recursing on a Cow consumes it, which fails borrow checks if there are any references to the underlying reference -- we know it's a Cow::Borrowed but the compiler cannot prove that.

How was this change tested?

Existing tests.

Copy link

codecov bot commented Nov 23, 2024

Codecov Report

Attention: Patch coverage is 95.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 80.41%. Comparing base (4a0fad2) to head (6ff8f1a).

Files with missing lines Patch % Lines
kernel/src/schema.rs 96.49% 1 Missing and 1 partial ⚠️
kernel/src/scan/data_skipping.rs 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #531      +/-   ##
==========================================
- Coverage   80.43%   80.41%   -0.02%     
==========================================
  Files          62       62              
  Lines       13645    13641       -4     
  Branches    13645    13641       -4     
==========================================
- Hits        10975    10970       -5     
- Misses       2114     2116       +2     
+ Partials      556      555       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cow -> ref change LGTM but left a question about the trait-level lifetime

/// Called for each array element encountered during the schema traversal. Implementations can
/// call [`Self::transform`] if they wish to recursively transform the array element type.
fn transform_array_element(&mut self, etype: &'a DataType) -> Option<Cow<'a, DataType>> {
self.transform(etype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious: why do we have these transform_array_element/map_key/map_val that just wrap transform?

Copy link
Collaborator Author

@scovich scovich Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're missing hook points this PR adds. The nested data skipping and column mapping PR will use them to improve its error messages (because column mapping transformation needs to recurse through maps and arrays, and if something goes wrong inside them, the error message should make it clear where the problem was.

(there's no recurse_into_xxx because no additional boilerplate is needed)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks :)

@@ -647,61 +647,73 @@ impl Display for DataType {
/// The provided `recurse_into_xxx` methods encapsulate the boilerplate work of recursing into the
/// child schema elements of each schema element. Implementations can call these as needed but will
/// generally not need to override them.
pub trait SchemaTransform {
pub trait SchemaTransform<'a> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does sharing the lifetime across methods make sense here? put another way: should the lifetimes of the types we are transforming all be the same? I suppose the answer is yes since each method is invokes as we are transforming some type and possibly recursing which would imply the same lifetime requirements?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an observation by @nicklan that allows the data skipping PR to not clone string field names it stores references to -- everything reachable through the entry point must survive at least as long as the entry point, so we can define a single lifetime for the entire traversal that visitors can take advantage of.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay yep makes sense! thank you!

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! one very minor nit. Thanks!

&mut self,
_ptype: Cow<'a, PrimitiveType>,
_ptype: &'a PrimitiveType,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit:

Suggested change
_ptype: &'a PrimitiveType,
_: &'a PrimitiveType,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was normal to keep the names of unused params in public method definitions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure if there's "normal" on this one. I don't find it adds much here so suggested removing it, but I don't feel strongly either way

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick regexp search over the kernel sources shows seven places that use _: Foo as a function argument and ~100 that use _foo: Foo. I think I'll leave this one as-is.

@scovich scovich merged commit 6eb2dc4 into delta-io:main Nov 26, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants