feat: Add RecordBatchOptions::skip_schema_check option #6855

andygrove · 2024-12-07T15:54:32Z

Which issue does this PR close?

N/A

Rationale for this change

In Comet, I would like the freedom to specify a logical type of Utf8 in a schema and then allow some batches to use Dictionary<_, Utf8> instead. In other words, I still want to enforce the logical type of columns in a batch but do not want to enforce that they all have the same physical type.

I understand that I will need to cast the batches to all have the same physical types before passing them on to any DataFusion operators.

What changes are included in this PR?

New option in RecordBatchOptions.

Are there any user-facing changes?

Yes, if users are creating the RecordBatchOptions struct without using the builder API then they will need so specify this new option.

Yup, this is a breaking change.

alamb

Makes sense to me -- thanks @andygrove

tustvold · 2024-12-08T08:42:59Z

I'm really unsure about this as it will break things in unexpected ways, lots of codepaths assume the schema is correct, what is the motivation for having RecordBatch with the same but incorrect schema? Why does the schema need to be the same?

andygrove · 2024-12-09T14:52:33Z

I'm really unsure about this as it will break things in unexpected ways, lots of codepaths assume the schema is correct, what is the motivation for having RecordBatch with the same but incorrect schema? Why does the schema need to be the same?

The motivation is that when reading Parquet files for one table, the physical type is not the same for all batches because sometimes a column is dictionary-encoded and sometimes it is not.

DataFusion requires that each operator has a single fixed schema for all batches, so we currently have to coerce all batches into the same schema. This is DataFusion limitation rather than an Arrow limitation, but DataFusion uses Arrow's RecordBatch.

It would be nice eventually if DataFusion would just require the logical schema to be the same for all batches but allow differences in the physical type.

tustvold · 2024-12-09T15:01:42Z

This is DataFusion limitation rather than an Arrow limitation, but DataFusion uses Arrow's RecordBatch.
It would be nice eventually if DataFusion would just require the logical schema to be the same for all batches but allow differences in the physical type.

I think this is key issue, the schema of the RecordBatch is the physical type. Arrow has no notion of a logical type, nor realistically can it when what this looks like is so use-case specific, are Int32 and Int64 the same logical type, what about differing decimal precisions?

Ultimately as the schema cannot vary within a single RecordBatch, the onus is on whatever is the origin of the inter-RecordBatch constraint to make a judgement on whether they accept heterogenous inputs, and to what degree they do so. The parquet writer, for example, allows for some variability in schema nullability - #4027.

This PR is effectively breaking a fairly fundamental invariant of RecordBatch to bypass checks in other components that are either necessary because the component relies on them, or unnecessary and therefore could/should just be removed. Or to phrase it differently, I can't see what correct usage there could be of this API that isn't just working around an over-zealous constraint in some unrelated system.

andygrove · 2024-12-09T17:47:24Z

Thanks for the feedback @alamb and @tustvold. I'll close this PR and will raise an issue in DataFusion about potentially using a different RecordBatch implementation there rather than using Arrow's RecordBatch.

tustvold · 2024-12-09T18:05:25Z

different RecordBatch implementation there rather than using Arrow's RecordBatch

I think my point would still stand with a custom RecordBatch, the constraint is on whatever it is that is consuming the RecordBatch. Having a RecordBatch lie about its contents seems odd...

Add RecordBatchOptions::skip_schema_check

4a27246

andygrove added the api-change Changes to the arrow API label Dec 7, 2024

github-actions bot added the arrow Changes to the arrow crate label Dec 7, 2024

update docs

dc60868

alamb approved these changes Dec 7, 2024

View reviewed changes

andygrove closed this Dec 9, 2024

andygrove deleted the record-batch-skip-schema-check branch December 9, 2024 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add RecordBatchOptions::skip_schema_check option #6855

feat: Add RecordBatchOptions::skip_schema_check option #6855

andygrove commented Dec 7, 2024

alamb left a comment

tustvold commented Dec 8, 2024 •

edited

Loading

andygrove commented Dec 9, 2024

tustvold commented Dec 9, 2024 •

edited

Loading

andygrove commented Dec 9, 2024

tustvold commented Dec 9, 2024

feat: Add RecordBatchOptions::skip_schema_check option #6855

feat: Add RecordBatchOptions::skip_schema_check option #6855

Conversation

andygrove commented Dec 7, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

tustvold commented Dec 8, 2024 • edited Loading

andygrove commented Dec 9, 2024

tustvold commented Dec 9, 2024 • edited Loading

andygrove commented Dec 9, 2024

tustvold commented Dec 9, 2024

tustvold commented Dec 8, 2024 •

edited

Loading

tustvold commented Dec 9, 2024 •

edited

Loading