-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adopt kernel schema types #2495
Conversation
ArrowDataType::Decimal128(p, s) => { | ||
Ok(DataType::Primitive(PrimitiveType::Decimal(*p, *s))) | ||
} | ||
ArrowDataType::Decimal256(p, s) => DataType::decimal(*p, *s).map_err(|_| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This arrow type is missing in kernel, I think that should be upstreamed there as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, given that precision/scale are limited to 38, does that make sense? not entirely sure anymore but I think the 256 bit type only makes sense for larger p/s values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's mostly for convenience. Users might have their source data using decimal256, but still precision/scale below 38.
@@ -1458,7 +1458,7 @@ def test_invalid_decimals(tmp_path: pathlib.Path, engine): | |||
|
|||
with pytest.raises( | |||
SchemaMismatchError, | |||
match=re.escape("Invalid data type for Delta Lake: decimal(39,1)"), | |||
match=re.escape("Invalid data type for Delta Lake: Decimal256(39, 1)"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change is a bit more significant then meets the eye - i.e. we are no longer trying to convert 256 bit decimals to a compliant type in delta. main reason being, for precisions that fit into 128 bit decimals, the user should really be using these, if the larger type is required, we cannot store it in the table anyhow.
crates/core/src/kernel/scalars.rs
Outdated
Struct(fields) => { | ||
let struct_fields = fields | ||
.iter() | ||
.flat_map(|f| TryFrom::try_from(f.as_ref())) | ||
.collect::<Vec<_>>(); | ||
let values = arr | ||
.as_any() | ||
.downcast_ref::<StructArray>() | ||
.and_then(|struct_arr| { | ||
struct_fields | ||
.iter() | ||
.map(|f: &StructField| { | ||
struct_arr | ||
.column_by_name(f.name()) | ||
.and_then(|c| Self::from_array(c.as_ref(), index)) | ||
}) | ||
.collect::<Option<Vec<_>>>() | ||
})?; | ||
if struct_fields.len() != values.len() { | ||
return None; | ||
} | ||
Some(Self::Struct( | ||
StructData::try_new(struct_fields, values).ok()?, | ||
)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich - here is an example of creating a struct scalar in the wild.
@@ -315,8 +315,8 @@ impl Snapshot { | |||
let stats_fields = if let Some(stats_cols) = self.table_config().stats_columns() { | |||
stats_cols | |||
.iter() | |||
.map(|col| match schema.field_with_name(col) { | |||
Ok(field) => match field.data_type() { | |||
.map(|col| match schema.field(col) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stats columns are defined for nested fields like parent_a.child_b
. Looking at kernel code, schema.field
will just look up by this name and fail. The reason I introduced #2519
Is #2519 appropriate for the kernel or should we support the nested lookup here and parse the stats col? Other uses of field_with_name should be audited to ensure they don't accept a nested name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question, let me investigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexwilcoxson-rel - stated some asking around if it would be legal to have a nested field defined in that property.
There are a couple of questions that also come to mind, and maybe you have an opinion? So using .
as a separator is - AFAIK - just a convention (certainly a common one though). In delta one could for instance define a column name that does contain a .
character under the columnMapping
feature. In kernel we have some places where we do represent nested coluns as dot separated, but we do already know that we eventually need an array representation.
One reason one might want to specify a nested field, is if we cannot compute stats for some child fields. There we could also just be lenient and simply omit these (actually not sure what we do now :D). Is that your case as well, or do you have others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw - the fact that .fields()
does not any parsing / traversing is by design. i.e. in this case we would need to split here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So turns out nested fields are perfectly valid, but they do require some more parsing, as it is also legal to escape field names that contain special characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i opened #2572 to track this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in our case if nested fields our omitted thats fine we only care about the stats on the root fields for query performance
we only included the nested fields cause they are not nullable fields of nullable parent fields and if we didn't collect them chckpoint parsing would break
also at the time, delta-rs did not respect the stats configuration so we just used the parquet writer properties to only collect stats on what we needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested some changes, but I would rather see this merged than wait for my nitpicks 😆
(also don't want to block up more testing before Data and AI Summit)
delta_kernel = { version = "0.1" } | ||
# delta_kernel = { path = "../delta-kernel-rs/kernel" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delta_kernel = { version = "0.1" } | |
# delta_kernel = { path = "../delta-kernel-rs/kernel" } | |
delta_kernel = { version = "0.1", path = "../delta-kernel-rs/kernel" } |
These can combined for ease of development, since cargo allows multiple locations
delta_kernel.workspace = true | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delta_kernel.workspace = true | |
delta-kernel = { workspace = true } |
Description
First pass adopting
delta_kernel
indelta-rs
.depends on delta-io/delta-kernel-rs#189 being merged and released.
This PR focusses on the schema types. Adopting the action types would be a follow up, but might have a even greater blast radius then this one.
Related Issue(s)
part of #2489
Documentation