Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apply a schema to fix column names #331

Merged
merged 54 commits into from
Oct 21, 2024
Merged

Conversation

nicklan
Copy link
Collaborator

@nicklan nicklan commented Sep 10, 2024

Enforce/apply the schema given when we evaluate an expression.

Given we want to go to expression based fixup and allow the final schema to dictate the output, we will need to do this.

This code will fix-up at all levels of the output, which is messy in arrow since schemas are embedded all over the place. The schema is only applied if the output of the expression doesn't exactly match the passed schema.

Copy link

codecov bot commented Sep 10, 2024

Codecov Report

Attention: Patch coverage is 88.73786% with 58 lines in your changes missing coverage. Please review.

Project coverage is 78.85%. Comparing base (2b1c46f) to head (929a187).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/engine/arrow_expression.rs 73.07% 16 Missing and 19 partials ⚠️
kernel/src/engine/ensure_data_types.rs 94.04% 17 Missing and 3 partials ⚠️
kernel/src/engine/arrow_utils.rs 87.50% 0 Missing and 1 partial ⚠️
kernel/src/error.rs 0.00% 1 Missing ⚠️
kernel/src/scan/mod.rs 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #331      +/-   ##
==========================================
+ Coverage   78.28%   78.85%   +0.56%     
==========================================
  Files          50       51       +1     
  Lines       10292    10640     +348     
  Branches    10292    10640     +348     
==========================================
+ Hits         8057     8390     +333     
  Misses       1781     1781              
- Partials      454      469      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial pass. Interested to understand better how this fits into the broader column mapping situation etc.

kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Show resolved Hide resolved
kernel/src/error.rs Show resolved Hide resolved
@nicklan nicklan marked this pull request as ready for review October 8, 2024 23:33
@nicklan nicklan force-pushed the fix-cm-name-take1 branch from 57f3c96 to 29b09f7 Compare October 8, 2024 23:34
@nicklan nicklan requested a review from scovich October 8, 2024 23:35
kernel/src/engine/arrow_utils.rs Outdated Show resolved Hide resolved
Comment on lines 381 to 382
// make column `col` with type `arrow_type` look like `kernel_type`. For now this only handles name
// transforms. if the actual data types don't match, this will return an error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about nullability? IIRC we have to read parquet with everything nullable, because parquet can't express the concept of a non-nullable field nesting inside a nullable field.

Or did we handle that already by just making everything nullable in our action schema?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We made all the top level structs nullable in the action schema (which is correct). But we do have non-nullable things inside. So I suspect maybe the final things don't match on nullability if the parquet schema really says everything is nullable. I will test and if we get mismatch we can fix that here too

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do decide to allow non-nullable fields nested inside nullable fields, we'll have to verify that our default engine's row visitor handles it gracefully?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do indeed have nullability mismatches in that the arrow schema says things can be null and our schema says they cannot.

I feel like this is something that might need a little thought to fix as it'll slow down metadata parsing if we have to go through and fix up the schema every time with nullability changes, although that will never be exposed to the connector.

There's also the issue of if we should adjust the metadata of each schema element, although that's less semantically important.

Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code now really makes the schema match the target, both for nullability and for metadata. This isn't cheap so we need to discuss it a bit probably

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we add to this comment and give the result<option<...>> semantics? that is (AFAICT) we return Some(new array) for any complex types and None if we validated that the primitives match up?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment -- I think we can simplify the method contract

kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
Comment on lines 412 to 414
fn make_data_type_physical(
logical_dt: &DataType,
column_mapping_mode: ColumnMappingMode,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Normally, the field's field id and physical name are stored in the field's metadata... but Iceberg requires field ids even for the internal columns used by Map and Array, and there's no way to associate metadata with those. So, when IcebergCompatV2 table feature is enabled, we have to remember the most recently-seen field, as well as the column path we descended through since then, so we can fetch the field ids out of that parent field's metadata.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't think Delta cares about those iceberg field ids -- even in column mapping field mode -- so maybe we can ignore all of this on the read path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to close my eyes and hope you're right ;)

kernel/src/scan/mod.rs Outdated Show resolved Hide resolved
kernel/src/scan/mod.rs Outdated Show resolved Hide resolved
kernel/src/scan/mod.rs Outdated Show resolved Hide resolved
kernel/src/scan/mod.rs Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed some comments, not all. if I haven't replied to comment I'm still getting to it.

Comment on lines 381 to 382
// make column `col` with type `arrow_type` look like `kernel_type`. For now this only handles name
// transforms. if the actual data types don't match, this will return an error
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We made all the top level structs nullable in the action schema (which is correct). But we do have non-nullable things inside. So I suspect maybe the final things don't match on nullability if the parquet schema really says everything is nullable. I will test and if we get mismatch we can fix that here too

kernel/src/engine/arrow_utils.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the feedback!

kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
arrow_field.is_nullable()
)));
}
if &kernel_field.metadata_as_string() != arrow_field.metadata() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly both kernel and arrow metadata are just HashMap and we can't impl PartialEq<HashMap<X,Y>> for HashMap<X,Z> because HashMap isn't in our crate.

So I've done an impl to compare MetadataValue to a String, and written a helper method to compare the two hashmaps. Not as general but still pretty clean.

kernel/src/schema.rs Outdated Show resolved Hide resolved
kernel/src/schema.rs Outdated Show resolved Hide resolved
kernel/src/schema.rs Outdated Show resolved Hide resolved
kernel/tests/golden_tables.rs Outdated Show resolved Hide resolved
@nicklan nicklan requested a review from scovich October 18, 2024 21:59
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting so close!

Sorry for all the review passes and suggestions, but there's so much (and such complex) code that it's important to simplify it as much as possible if we want to stay sane...

}
kernel_metadata
.iter()
.all(|(key, value)| arrow_metadata.get(key).map_or(false, |v| *value == *v))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work?

Suggested change
.all(|(key, value)| arrow_metadata.get(key).map_or(false, |v| *value == *v))
.all(|(key, value)| arrow_metadata.get(key).is_some_and(|v| *value == *v))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah, yep. You should let the rust guys know: https://doc.rust-lang.org/src/std/collections/hash/map.rs.html#1289

kernel/src/schema.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rescuing #331 (comment):

We're still only at 9% patch coverage. We might need to take @zachschuermann advice and make a schema just for this?

Maybe you could reuse this one from the schema depth checker test?
https://github.com/delta-incubator/delta-kernel-rs/blob/main/kernel/src/schema.rs#L1005
It has a wide variety of nested structures, at least.

Down to 7% now...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I've added some more tests and it's up now

kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
),
])
.map(|(arrow_field, (target_type, nullable))| {
StructField::new(arrow_field.name(), target_type.clone(), nullable)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking -- no metadata here, because this is a hidden/internal struct that isn't part of the user-visible schema?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Appreciate all the feedback. Adding tests actually uncovered that we weren't properly validating the nullability of array/map elements.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I've added some more tests and it's up now

kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
),
])
.map(|(arrow_field, (target_type, nullable))| {
StructField::new(arrow_field.name(), target_type.clone(), nullable)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

}
kernel_metadata
.iter()
.all(|(key, value)| arrow_metadata.get(key).map_or(false, |v| *value == *v))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah, yep. You should let the rust guys know: https://doc.rust-lang.org/src/std/collections/hash/map.rs.html#1289

kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
@nicklan nicklan requested a review from scovich October 18, 2024 23:50
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last round of simplification and I think we're done!

kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
Comment on lines 448 to 453
let transformed_field = new_field_with_metadata(
field.name(),
transformed_values.data_type(),
target_inner_type.contains_null,
None,
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh... after all that, I just realized that this one is simply

Suggested change
let transformed_field = new_field_with_metadata(
field.name(),
transformed_values.data_type(),
target_inner_type.contains_null,
None,
);
let transformed_field = ArrowField::new(
field.name(),
transformed_values.data_type().clone(),
target_inner_type.contains_null,
);

... which leaves new_field_with_metadata with only a single call site.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, we probably should keep the helper function, and update our impl TryFrom<&StructField> for ArrowField to use it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on updating that. Right now those things are protected by different feature flags.

I've filled #411 to sort this out and make the above suggested change.

Comment on lines -149 to -155
(
DataType::Primitive(PrimitiveType::Decimal(kernel_prec, kernel_scale)),
ArrowDataType::Decimal128(arrow_prec, arrow_scale),
) if arrow_prec == kernel_prec && *arrow_scale == *kernel_scale as i8 => {
// decimal isn't primitive in arrow. cast above is okay as we limit range
Ok(DataTypeCompat::Identical)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this match arm deleted on purpose?
Does the cast-compat path cover it now, perhaps?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. Should have noted that in a comment here.

Comment on lines 894 to 895
) -> bool {
if kernel_metadata.len() != arrow_metadata.len() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not rare for fields to have no metadata, so perhaps worth doing:

Suggested change
) -> bool {
if kernel_metadata.len() != arrow_metadata.len() {
) -> bool {
if kernel_metadata.is_empty() && arrow_metadata.is_empty() {
return true;
}
if kernel_metadata.len() != arrow_metadata.len() {

But on the other hand, an empty iterator should be very cheap to traverse, so meh?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_empty() just calls self.len() == 0. So since we're optimizing here I just went with saving kernel len and then early exit if the lens are equal and kernel len is 0.

@@ -126,15 +127,46 @@ fn can_upcast_to_decimal(
&& target_precision - source_precision >= (target_scale - source_scale) as u8
}

/// check if two fields have the same nullability and metadata
fn test_nullability_and_metadata(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming? This sounds like a unit test at first glance... maybe something like verify_nullability_and_metadata_eq would be clearer?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe ensure_nullability_and_metadata, following the same convention as ensure_data_types?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: Suggest to split out ensure_nullability as a separate method, e.g.

fn ensure_nullability(
    kernel_nullable: bool, 
    arrow_nullable: bool, 
    context: impl FnOnce() -> impl Into<ToString>,
) -> DeltaResult<()> {
    if kernel_nullable == arrow_nullable {
        Ok(())
    } else {
        Err(Error::Generic(format!(
            "{} {kernel_nullable} in kernel and {arrow_nullable} in arrow",
            context().into(),
        )))
    }
}

and then all three call sites can use it:

ensure_nullability(
    kernel_field.nullable, 
    arrow_field.is_nullable(), 
    || format!("Field {} has nullability", field.name),
)?

and e.g.

            if check_nullability_and_metadata {
                ensure_nullability(
                    inner_type.contains_null, 
                    arrow_list_field.is_nullable(),
                    || "List has contains_null",
                )?;
            }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could go a step further and wrap all this up in a helper struct:

// Convenient public entry point
pub(crate) fn ensure_data_types(...) -> DeltaResult<DataTypeCompat> {
    let check = EnsureDataTypes { check_nullability_and_metadata };
    check.ensure_data_types(kernel_type, arrow_type)
}

struct EnsureDataTypes {
  check_nullability_and_metadata: bool,
}
impl EnsureDataTypes {
    fn ensure_data_types(&self, ...) -> DeltaResult<DataTypeCompat> {
       ... today's match statement, but simpler ...
    }
    
    fn ensure_nullability(&self, ...) -> DeltaResult<()> {
        if self.check_nullability_and_metadata && kernel_nullable != arrow_nullable => {
            Err(...)
        } else {
            _ => Ok(()),
        }
    }

    fn ensure_nullability_and_metadata(&self, ...) -> DeltaResult<()> {
        self.ensure_nullability(...)?;
        match self.check_nullability_and_metadata {
           true if !metadata_eq(&kernel_field.metadata, arrow_field.metadata()) => {
               Err(...)
           }
           _ => Ok(()),
    }
}

Then, the implementation simplifies to things like:

        (DataType::Array(inner_type), ArrowDataType::List(arrow_list_field)) => {
            self.ensure_nullability(
                inner_type.contains_null, 
                arrow_list_field.is_nullable(),
                || "List has contains_null",
            )?;
            self.ensure_data_types(
                &inner_type.element_type,
                arrow_list_field.data_type(),
            )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this is nicer.

I ended up also moving all of the ensure_x stuff to its own module as the utils mod was getting real big.

Comment on lines 182 to 183
if check_nullability_and_metadata
&& inner_type.contains_null != arrow_list_field.is_nullable()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use the helper method? There should never be any metadata on either field, so that part will be a no-op?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh -- because we're not comparing two fields here!

(similar story for maps below)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I split out that method thinking I would call it in multiple places, then realized I wouldn't, but this method is big enough that I thought having the nullabilty check pulled out still made sense.

@@ -5,26 +5,32 @@
#[cfg(feature = "arrow-conversion")]
pub(crate) mod arrow_conversion;

#[cfg(feature = "arrow-expression")]
#[cfg(all(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was broken before because arrow_expression uses stuff in arrow_utils so enabling only arrow_expression would not have compiled.

pub mod arrow_expression;

#[cfg(any(feature = "default-engine", feature = "sync-engine"))]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to a macro to avoid the repeated checking of the two feature flags. Hopefully reviewers agree it's better :)

@nicklan nicklan requested a review from scovich October 21, 2024 20:32
}

/// Capture the compatibility between two data-types, as passed to [`ensure_data_types`]
pub(crate) enum DataTypeCompat {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note this all moved to ensure_data_types.rs

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Just a few suggestions to consider before merging.

kernel/src/engine/ensure_data_types.rs Outdated Show resolved Hide resolved
kernel/src/engine/ensure_data_types.rs Outdated Show resolved Hide resolved
)
}
(DataType::Map(kernel_map_type), ArrowDataType::Map(arrow_map_type, _)) => {
if let ArrowDataType::Struct(fields) = arrow_map_type.data_type() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good place for the let-else idiom?

let ArrowDataType::Struct(fields) = arrow_map_type.data_type() else {
    return Err(make_arrow_error("Arrow map type wasn't a struct.")
};

Comment on lines 83 to 84
let mut fields = fields.iter();
if let Some(key_type) = fields.next() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike kernel, arrow fields Deref to slice. So we should be able to do:

let [key_type, value_type] = fields else {
  return Err(...);
};
self.ensure_data_types(... key ...);
self.ensure_data_types(... value ...);
self.ensure_nullability(... value ...);

kernel/src/engine/ensure_data_types.rs Outdated Show resolved Hide resolved
kernel/src/engine/arrow_expression.rs Outdated Show resolved Hide resolved
kernel/src/engine/ensure_data_types.rs Outdated Show resolved Hide resolved
kernel/src/schema.rs Show resolved Hide resolved
@github-actions github-actions bot added the breaking-change Change that will require a version bump label Oct 21, 2024
@nicklan nicklan removed the breaking-change Change that will require a version bump label Oct 21, 2024
@nicklan nicklan merged commit cd53bc1 into delta-io:main Oct 21, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants