Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement schema compatibility check #554

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

OussamaSaoudi-db
Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db commented Nov 29, 2024

What changes are proposed in this pull request?

This PR introduces a schema compatibility check which can be used to validate schema updates in a CDF range are compatible with the final CDF schema.

Closes #523

How was this change tested?

Schema compatibility tests are added that check the following:

  • TODO

Copy link

codecov bot commented Nov 29, 2024

Codecov Report

Attention: Patch coverage is 97.75281% with 6 lines in your changes missing coverage. Please review.

Project coverage is 81.01%. Comparing base (d24c76b) to head (ba3d0ad).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/table_changes/schema_compat.rs 97.75% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #554      +/-   ##
==========================================
+ Coverage   80.75%   81.01%   +0.26%     
==========================================
  Files          67       68       +1     
  Lines       14080    14545     +465     
  Branches    14080    14545     +465     
==========================================
+ Hits        11370    11784     +414     
- Misses       2134     2187      +53     
+ Partials      576      574       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

);
name_equal && nullability_equal && data_type_equal
}
None => read_field.is_nullable(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The None case is a point at which I differ from the delta implementation. I'm not convinced by the code there. If we don't find the read field in the existing schema, then we just ignore it. I think this should only pass if the new field in the read schema is nullable.

I may be missing something tho 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nullability is... complicated. But I think what you say makes sense -- technically it could be ok for the read field to not be nullable, if the parent is nullable and the parent is null for all rows where the child is null. But if the parent is hard-wired null then we shouldn't be recursing to its children in the first place.

// == read_nullable || !existing_nullable
read_nullable || !existing_nullable
}
fn is_struct_read_compatible(existing: &StructType, read_type: &StructType) -> bool {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering using a DeltaResult instead of a bool so we can return better errors about how a schema differs. Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense. Something similar to ValidateColumnMappings in #543, which returns an Err with the offending column name path?

Some(existing_field) => {
let name_equal = existing_field.name() == read_field.name();

let nullability_equal =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let nullability_equal =
let nullability_compatible =

Comment on lines +53 to +54
is_datatype_read_compatible(a.element_type(), b.element_type())
&& is_nullability_compatible(a.contains_null(), b.contains_null())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit: consider swapping these lines to structurally match the code below?

);
name_equal && nullability_equal && data_type_equal
}
None => read_field.is_nullable(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nullability is... complicated. But I think what you say makes sense -- technically it could be ok for the read field to not be nullable, if the parent is nullable and the parent is null for all rows where the child is null. But if the parent is hard-wired null then we shouldn't be recursing to its children in the first place.

use crate::schema::{DataType, Schema, StructField, StructType};

fn is_nullability_compatible(existing_nullable: bool, read_nullable: bool) -> bool {
// The case to avoid is when the read_schema is non-nullable and the existing one is nullable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"avoid" as in "it's illegal to attempt reading a nullable underlying as non-nullable"? (maybe just say that?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this method takes two args of the same type, but it is not commutative. Subtly error-prone, and I don't know the best way to make it safe? The arg names are a good start, but rust doesn't allow named args at call sites. And the name of the function does not give any indication of the correct arg order.

Is it worth using a struct just to force named args? Seems clunky. Or maybe we can choose an asymmetric function name of some kind, that indicates which arg comes first?

(whatever solution we choose, we should probably apply it to the is_struct_read_compatible as well)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility: Add these as methods on StructType itself? Then callers would be encouraged to do things like:

table_schema.can_read_as(read_schema)

... but I don't know a good way to do that for the nullability compat check since it's a plain boolean and doesn't always apply to a struct field (can also be array element or map value).

We could define a helper trait for struct/map/array, but that just pushes the problem to the trait impl (and there is only one call site for each type right now).

Comment on lines +19 to +32
let existing_names: HashSet<String> = existing
.fields()
.map(|field| field.name().clone())
.collect();
let read_names: HashSet<String> = read_type
.fields()
.map(|field| field.name().clone())
.collect();
if !existing_names.is_subset(&read_names) {
return false;
}
read_type
.fields()
.all(|read_field| match existing_fields.get(read_field.name()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should only need to materialize a hash set for one side (build), and just stream the other side's fields past it (probe)?

Also: kernel's StructType::fields member is already an IndexMap so you should have O(1) name lookups without building any additional hash sets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow CDF scans with schema evolution
2 participants