-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement schema compatibility check #554
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #554 +/- ##
==========================================
+ Coverage 80.75% 81.01% +0.26%
==========================================
Files 67 68 +1
Lines 14080 14545 +465
Branches 14080 14545 +465
==========================================
+ Hits 11370 11784 +414
- Misses 2134 2187 +53
+ Partials 576 574 -2 ☔ View full report in Codecov by Sentry. |
); | ||
name_equal && nullability_equal && data_type_equal | ||
} | ||
None => read_field.is_nullable(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The None
case is a point at which I differ from the delta implementation. I'm not convinced by the code there. If we don't find the read field in the existing schema, then we just ignore it. I think this should only pass if the new field in the read schema is nullable.
I may be missing something tho 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nullability is... complicated. But I think what you say makes sense -- technically it could be ok for the read field to not be nullable, if the parent is nullable and the parent is null for all rows where the child is null. But if the parent is hard-wired null then we shouldn't be recursing to its children in the first place.
// == read_nullable || !existing_nullable | ||
read_nullable || !existing_nullable | ||
} | ||
fn is_struct_read_compatible(existing: &StructType, read_type: &StructType) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was considering using a DeltaResult
instead of a bool so we can return better errors about how a schema differs. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense. Something similar to ValidateColumnMappings in #543, which returns an Err
with the offending column name path?
Some(existing_field) => { | ||
let name_equal = existing_field.name() == read_field.name(); | ||
|
||
let nullability_equal = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let nullability_equal = | |
let nullability_compatible = |
is_datatype_read_compatible(a.element_type(), b.element_type()) | ||
&& is_nullability_compatible(a.contains_null(), b.contains_null()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny nit: consider swapping these lines to structurally match the code below?
); | ||
name_equal && nullability_equal && data_type_equal | ||
} | ||
None => read_field.is_nullable(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nullability is... complicated. But I think what you say makes sense -- technically it could be ok for the read field to not be nullable, if the parent is nullable and the parent is null for all rows where the child is null. But if the parent is hard-wired null then we shouldn't be recursing to its children in the first place.
use crate::schema::{DataType, Schema, StructField, StructType}; | ||
|
||
fn is_nullability_compatible(existing_nullable: bool, read_nullable: bool) -> bool { | ||
// The case to avoid is when the read_schema is non-nullable and the existing one is nullable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"avoid" as in "it's illegal to attempt reading a nullable underlying as non-nullable"? (maybe just say that?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this method takes two args of the same type, but it is not commutative. Subtly error-prone, and I don't know the best way to make it safe? The arg names are a good start, but rust doesn't allow named args at call sites. And the name of the function does not give any indication of the correct arg order.
Is it worth using a struct just to force named args? Seems clunky. Or maybe we can choose an asymmetric function name of some kind, that indicates which arg comes first?
(whatever solution we choose, we should probably apply it to the is_struct_read_compatible
as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility: Add these as methods on StructType
itself? Then callers would be encouraged to do things like:
table_schema.can_read_as(read_schema)
... but I don't know a good way to do that for the nullability compat check since it's a plain boolean and doesn't always apply to a struct field (can also be array element or map value).
We could define a helper trait for struct/map/array, but that just pushes the problem to the trait impl (and there is only one call site for each type right now).
let existing_names: HashSet<String> = existing | ||
.fields() | ||
.map(|field| field.name().clone()) | ||
.collect(); | ||
let read_names: HashSet<String> = read_type | ||
.fields() | ||
.map(|field| field.name().clone()) | ||
.collect(); | ||
if !existing_names.is_subset(&read_names) { | ||
return false; | ||
} | ||
read_type | ||
.fields() | ||
.all(|read_field| match existing_fields.get(read_field.name()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we should only need to materialize a hash set for one side (build), and just stream the other side's fields past it (probe)?
Also: kernel's StructType::fields
member is already an IndexMap
so you should have O(1) name lookups without building any additional hash sets.
What changes are proposed in this pull request?
This PR introduces a schema compatibility check which can be used to validate schema updates in a CDF range are compatible with the final CDF schema.
Closes #523
How was this change tested?
Schema compatibility tests are added that check the following: