-
Notifications
You must be signed in to change notification settings - Fork 853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add RecordBatchOptions::skip_schema_check option #6855
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me -- thanks @andygrove
I'm really unsure about this as it will break things in unexpected ways, lots of codepaths assume the schema is correct, what is the motivation for having RecordBatch with the same but incorrect schema? Why does the schema need to be the same? |
The motivation is that when reading Parquet files for one table, the physical type is not the same for all batches because sometimes a column is dictionary-encoded and sometimes it is not. DataFusion requires that each operator has a single fixed schema for all batches, so we currently have to coerce all batches into the same schema. This is DataFusion limitation rather than an Arrow limitation, but DataFusion uses Arrow's RecordBatch. It would be nice eventually if DataFusion would just require the logical schema to be the same for all batches but allow differences in the physical type. |
I think this is key issue, the schema of the RecordBatch is the physical type. Arrow has no notion of a logical type, nor realistically can it when what this looks like is so use-case specific, are Int32 and Int64 the same logical type, what about differing decimal precisions? Ultimately as the schema cannot vary within a single RecordBatch, the onus is on whatever is the origin of the inter-RecordBatch constraint to make a judgement on whether they accept heterogenous inputs, and to what degree they do so. The parquet writer, for example, allows for some variability in schema nullability - #4027. This PR is effectively breaking a fairly fundamental invariant of RecordBatch to bypass checks in other components that are either necessary because the component relies on them, or unnecessary and therefore could/should just be removed. Or to phrase it differently, I can't see what correct usage there could be of this API that isn't just working around an over-zealous constraint in some unrelated system. |
I think my point would still stand with a custom RecordBatch, the constraint is on whatever it is that is consuming the RecordBatch. Having a RecordBatch lie about its contents seems odd... |
Which issue does this PR close?
N/A
Rationale for this change
In Comet, I would like the freedom to specify a logical type of Utf8 in a schema and then allow some batches to use Dictionary<_, Utf8> instead. In other words, I still want to enforce the logical type of columns in a batch but do not want to enforce that they all have the same physical type.
I understand that I will need to cast the batches to all have the same physical types before passing them on to any DataFusion operators.
What changes are included in this PR?
New option in
RecordBatchOptions
.Are there any user-facing changes?
Yes, if users are creating the
RecordBatchOptions
struct without using the builder API then they will need so specify this new option.Yup, this is a breaking change.