-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apply a schema to fix column names #331
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #331 +/- ##
==========================================
+ Coverage 78.28% 78.85% +0.56%
==========================================
Files 50 51 +1
Lines 10292 10640 +348
Branches 10292 10640 +348
==========================================
+ Hits 8057 8390 +333
Misses 1781 1781
- Partials 454 469 +15 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial pass. Interested to understand better how this fits into the broader column mapping situation etc.
57f3c96
to
29b09f7
Compare
// make column `col` with type `arrow_type` look like `kernel_type`. For now this only handles name | ||
// transforms. if the actual data types don't match, this will return an error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about nullability? IIRC we have to read parquet with everything nullable, because parquet can't express the concept of a non-nullable field nesting inside a nullable field.
Or did we handle that already by just making everything nullable in our action schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We made all the top level structs nullable in the action schema (which is correct). But we do have non-nullable things inside. So I suspect maybe the final things don't match on nullability if the parquet schema really says everything is nullable. I will test and if we get mismatch we can fix that here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do decide to allow non-nullable fields nested inside nullable fields, we'll have to verify that our default engine's row visitor handles it gracefully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do indeed have nullability mismatches in that the arrow schema says things can be null and our schema says they cannot.
I feel like this is something that might need a little thought to fix as it'll slow down metadata parsing if we have to go through and fix up the schema every time with nullability changes, although that will never be exposed to the connector.
There's also the issue of if we should adjust the metadata of each schema element, although that's less semantically important.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code now really makes the schema match the target, both for nullability and for metadata. This isn't cheap so we need to discuss it a bit probably
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we add to this comment and give the result<option<...>> semantics? that is (AFAICT) we return Some(new array)
for any complex types and None
if we validated that the primitives match up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment -- I think we can simplify the method contract
kernel/src/scan/mod.rs
Outdated
fn make_data_type_physical( | ||
logical_dt: &DataType, | ||
column_mapping_mode: ColumnMappingMode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: Normally, the field's field id and physical name are stored in the field's metadata... but Iceberg requires field ids even for the internal columns used by Map and Array, and there's no way to associate metadata with those. So, when IcebergCompatV2 table feature is enabled, we have to remember the most recently-seen field, as well as the column path we descended through since then, so we can fetch the field ids out of that parent field's metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I don't think Delta cares about those iceberg field ids -- even in column mapping field mode -- so maybe we can ignore all of this on the read path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to close my eyes and hope you're right ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed some comments, not all. if I haven't replied to comment I'm still getting to it.
// make column `col` with type `arrow_type` look like `kernel_type`. For now this only handles name | ||
// transforms. if the actual data types don't match, this will return an error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We made all the top level structs nullable in the action schema (which is correct). But we do have non-nullable things inside. So I suspect maybe the final things don't match on nullability if the parquet schema really says everything is nullable. I will test and if we get mismatch we can fix that here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the feedback!
kernel/src/engine/arrow_utils.rs
Outdated
arrow_field.is_nullable() | ||
))); | ||
} | ||
if &kernel_field.metadata_as_string() != arrow_field.metadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly both kernel and arrow metadata are just HashMap
and we can't impl PartialEq<HashMap<X,Y>> for HashMap<X,Z>
because HashMap
isn't in our crate.
So I've done an impl to compare MetadataValue
to a String
, and written a helper method to compare the two hashmaps. Not as general but still pretty clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting so close!
Sorry for all the review passes and suggestions, but there's so much (and such complex) code that it's important to simplify it as much as possible if we want to stay sane...
kernel/src/engine/arrow_utils.rs
Outdated
} | ||
kernel_metadata | ||
.iter() | ||
.all(|(key, value)| arrow_metadata.get(key).map_or(false, |v| *value == *v)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work?
.all(|(key, value)| arrow_metadata.get(key).map_or(false, |v| *value == *v)) | |
.all(|(key, value)| arrow_metadata.get(key).is_some_and(|v| *value == *v)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hah, yep. You should let the rust guys know: https://doc.rust-lang.org/src/std/collections/hash/map.rs.html#1289
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rescuing #331 (comment):
We're still only at 9% patch coverage. We might need to take @zachschuermann advice and make a schema just for this?
Maybe you could reuse this one from the schema depth checker test?
https://github.com/delta-incubator/delta-kernel-rs/blob/main/kernel/src/schema.rs#L1005
It has a wide variety of nested structures, at least.
Down to 7% now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I've added some more tests and it's up now
), | ||
]) | ||
.map(|(arrow_field, (target_type, nullable))| { | ||
StructField::new(arrow_field.name(), target_type.clone(), nullable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double checking -- no metadata here, because this is a hidden/internal struct that isn't part of the user-visible schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Appreciate all the feedback. Adding tests actually uncovered that we weren't properly validating the nullability of array/map elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I've added some more tests and it's up now
), | ||
]) | ||
.map(|(arrow_field, (target_type, nullable))| { | ||
StructField::new(arrow_field.name(), target_type.clone(), nullable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
kernel/src/engine/arrow_utils.rs
Outdated
} | ||
kernel_metadata | ||
.iter() | ||
.all(|(key, value)| arrow_metadata.get(key).map_or(false, |v| *value == *v)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hah, yep. You should let the rust guys know: https://doc.rust-lang.org/src/std/collections/hash/map.rs.html#1289
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last round of simplification and I think we're done!
let transformed_field = new_field_with_metadata( | ||
field.name(), | ||
transformed_values.data_type(), | ||
target_inner_type.contains_null, | ||
None, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh... after all that, I just realized that this one is simply
let transformed_field = new_field_with_metadata( | |
field.name(), | |
transformed_values.data_type(), | |
target_inner_type.contains_null, | |
None, | |
); | |
let transformed_field = ArrowField::new( | |
field.name(), | |
transformed_values.data_type().clone(), | |
target_inner_type.contains_null, | |
); |
... which leaves new_field_with_metadata
with only a single call site.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, we probably should keep the helper function, and update our impl TryFrom<&StructField> for ArrowField to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on updating that. Right now those things are protected by different feature flags.
I've filled #411 to sort this out and make the above suggested change.
( | ||
DataType::Primitive(PrimitiveType::Decimal(kernel_prec, kernel_scale)), | ||
ArrowDataType::Decimal128(arrow_prec, arrow_scale), | ||
) if arrow_prec == kernel_prec && *arrow_scale == *kernel_scale as i8 => { | ||
// decimal isn't primitive in arrow. cast above is okay as we limit range | ||
Ok(DataTypeCompat::Identical) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this match arm deleted on purpose?
Does the cast-compat path cover it now, perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. Should have noted that in a comment here.
kernel/src/engine/arrow_utils.rs
Outdated
) -> bool { | ||
if kernel_metadata.len() != arrow_metadata.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not rare for fields to have no metadata, so perhaps worth doing:
) -> bool { | |
if kernel_metadata.len() != arrow_metadata.len() { | |
) -> bool { | |
if kernel_metadata.is_empty() && arrow_metadata.is_empty() { | |
return true; | |
} | |
if kernel_metadata.len() != arrow_metadata.len() { |
But on the other hand, an empty iterator should be very cheap to traverse, so meh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_empty()
just calls self.len() == 0
. So since we're optimizing here I just went with saving kernel len and then early exit if the lens are equal and kernel len is 0.
kernel/src/engine/arrow_utils.rs
Outdated
@@ -126,15 +127,46 @@ fn can_upcast_to_decimal( | |||
&& target_precision - source_precision >= (target_scale - source_scale) as u8 | |||
} | |||
|
|||
/// check if two fields have the same nullability and metadata | |||
fn test_nullability_and_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming? This sounds like a unit test at first glance... maybe something like verify_nullability_and_metadata_eq
would be clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe ensure_nullability_and_metadata
, following the same convention as ensure_data_types
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: Suggest to split out ensure_nullability
as a separate method, e.g.
fn ensure_nullability(
kernel_nullable: bool,
arrow_nullable: bool,
context: impl FnOnce() -> impl Into<ToString>,
) -> DeltaResult<()> {
if kernel_nullable == arrow_nullable {
Ok(())
} else {
Err(Error::Generic(format!(
"{} {kernel_nullable} in kernel and {arrow_nullable} in arrow",
context().into(),
)))
}
}
and then all three call sites can use it:
ensure_nullability(
kernel_field.nullable,
arrow_field.is_nullable(),
|| format!("Field {} has nullability", field.name),
)?
and e.g.
if check_nullability_and_metadata {
ensure_nullability(
inner_type.contains_null,
arrow_list_field.is_nullable(),
|| "List has contains_null",
)?;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could go a step further and wrap all this up in a helper struct:
// Convenient public entry point
pub(crate) fn ensure_data_types(...) -> DeltaResult<DataTypeCompat> {
let check = EnsureDataTypes { check_nullability_and_metadata };
check.ensure_data_types(kernel_type, arrow_type)
}
struct EnsureDataTypes {
check_nullability_and_metadata: bool,
}
impl EnsureDataTypes {
fn ensure_data_types(&self, ...) -> DeltaResult<DataTypeCompat> {
... today's match statement, but simpler ...
}
fn ensure_nullability(&self, ...) -> DeltaResult<()> {
if self.check_nullability_and_metadata && kernel_nullable != arrow_nullable => {
Err(...)
} else {
_ => Ok(()),
}
}
fn ensure_nullability_and_metadata(&self, ...) -> DeltaResult<()> {
self.ensure_nullability(...)?;
match self.check_nullability_and_metadata {
true if !metadata_eq(&kernel_field.metadata, arrow_field.metadata()) => {
Err(...)
}
_ => Ok(()),
}
}
Then, the implementation simplifies to things like:
(DataType::Array(inner_type), ArrowDataType::List(arrow_list_field)) => {
self.ensure_nullability(
inner_type.contains_null,
arrow_list_field.is_nullable(),
|| "List has contains_null",
)?;
self.ensure_data_types(
&inner_type.element_type,
arrow_list_field.data_type(),
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep this is nicer.
I ended up also moving all of the ensure_x
stuff to its own module as the utils
mod was getting real big.
kernel/src/engine/arrow_utils.rs
Outdated
if check_nullability_and_metadata | ||
&& inner_type.contains_null != arrow_list_field.is_nullable() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use the helper method? There should never be any metadata on either field, so that part will be a no-op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh -- because we're not comparing two fields here!
(similar story for maps below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I split out that method thinking I would call it in multiple places, then realized I wouldn't, but this method is big enough that I thought having the nullabilty check pulled out still made sense.
@@ -5,26 +5,32 @@ | |||
#[cfg(feature = "arrow-conversion")] | |||
pub(crate) mod arrow_conversion; | |||
|
|||
#[cfg(feature = "arrow-expression")] | |||
#[cfg(all( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was broken before because arrow_expression
uses stuff in arrow_utils
so enabling only arrow_expression
would not have compiled.
pub mod arrow_expression; | ||
|
||
#[cfg(any(feature = "default-engine", feature = "sync-engine"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this to a macro to avoid the repeated checking of the two feature flags. Hopefully reviewers agree it's better :)
} | ||
|
||
/// Capture the compatibility between two data-types, as passed to [`ensure_data_types`] | ||
pub(crate) enum DataTypeCompat { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note this all moved to ensure_data_types.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Just a few suggestions to consider before merging.
) | ||
} | ||
(DataType::Map(kernel_map_type), ArrowDataType::Map(arrow_map_type, _)) => { | ||
if let ArrowDataType::Struct(fields) = arrow_map_type.data_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good place for the let-else idiom?
let ArrowDataType::Struct(fields) = arrow_map_type.data_type() else {
return Err(make_arrow_error("Arrow map type wasn't a struct.")
};
let mut fields = fields.iter(); | ||
if let Some(key_type) = fields.next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike kernel, arrow fields Deref to slice. So we should be able to do:
let [key_type, value_type] = fields else {
return Err(...);
};
self.ensure_data_types(... key ...);
self.ensure_data_types(... value ...);
self.ensure_nullability(... value ...);
Co-authored-by: Ryan Johnson <[email protected]>
Enforce/apply the schema given when we evaluate an expression.
Given we want to go to expression based fixup and allow the final schema to dictate the output, we will need to do this.
This code will fix-up at all levels of the output, which is messy in arrow since schemas are embedded all over the place. The schema is only applied if the output of the expression doesn't exactly match the passed schema.