Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add RecordBatchOptions::skip_schema_check option #6855

Closed

Conversation

andygrove
Copy link
Member

Which issue does this PR close?

N/A

Rationale for this change

In Comet, I would like the freedom to specify a logical type of Utf8 in a schema and then allow some batches to use Dictionary<_, Utf8> instead. In other words, I still want to enforce the logical type of columns in a batch but do not want to enforce that they all have the same physical type.

I understand that I will need to cast the batches to all have the same physical types before passing them on to any DataFusion operators.

What changes are included in this PR?

New option in RecordBatchOptions.

Are there any user-facing changes?

Yes, if users are creating the RecordBatchOptions struct without using the builder API then they will need so specify this new option.

Yup, this is a breaking change.

@andygrove andygrove added the api-change Changes to the arrow API label Dec 7, 2024
@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 7, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- thanks @andygrove

@tustvold
Copy link
Contributor

tustvold commented Dec 8, 2024

I'm really unsure about this as it will break things in unexpected ways, lots of codepaths assume the schema is correct, what is the motivation for having RecordBatch with the same but incorrect schema? Why does the schema need to be the same?

@andygrove
Copy link
Member Author

I'm really unsure about this as it will break things in unexpected ways, lots of codepaths assume the schema is correct, what is the motivation for having RecordBatch with the same but incorrect schema? Why does the schema need to be the same?

The motivation is that when reading Parquet files for one table, the physical type is not the same for all batches because sometimes a column is dictionary-encoded and sometimes it is not.

DataFusion requires that each operator has a single fixed schema for all batches, so we currently have to coerce all batches into the same schema. This is DataFusion limitation rather than an Arrow limitation, but DataFusion uses Arrow's RecordBatch.

It would be nice eventually if DataFusion would just require the logical schema to be the same for all batches but allow differences in the physical type.

@tustvold
Copy link
Contributor

tustvold commented Dec 9, 2024

This is DataFusion limitation rather than an Arrow limitation, but DataFusion uses Arrow's RecordBatch.
It would be nice eventually if DataFusion would just require the logical schema to be the same for all batches but allow differences in the physical type.

I think this is key issue, the schema of the RecordBatch is the physical type. Arrow has no notion of a logical type, nor realistically can it when what this looks like is so use-case specific, are Int32 and Int64 the same logical type, what about differing decimal precisions?

Ultimately as the schema cannot vary within a single RecordBatch, the onus is on whatever is the origin of the inter-RecordBatch constraint to make a judgement on whether they accept heterogenous inputs, and to what degree they do so. The parquet writer, for example, allows for some variability in schema nullability - #4027.

This PR is effectively breaking a fairly fundamental invariant of RecordBatch to bypass checks in other components that are either necessary because the component relies on them, or unnecessary and therefore could/should just be removed. Or to phrase it differently, I can't see what correct usage there could be of this API that isn't just working around an over-zealous constraint in some unrelated system.

@andygrove
Copy link
Member Author

Thanks for the feedback @alamb and @tustvold. I'll close this PR and will raise an issue in DataFusion about potentially using a different RecordBatch implementation there rather than using Arrow's RecordBatch.

@andygrove andygrove closed this Dec 9, 2024
@andygrove andygrove deleted the record-batch-skip-schema-check branch December 9, 2024 17:56
@tustvold
Copy link
Contributor

tustvold commented Dec 9, 2024

different RecordBatch implementation there rather than using Arrow's RecordBatch

I think my point would still stand with a custom RecordBatch, the constraint is on whatever it is that is consuming the RecordBatch. Having a RecordBatch lie about its contents seems odd...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants