-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-45028: [C++][Compute] Allow cast to reorder struct fields #45246
base: main
Are you sure you want to change the base?
GH-45028: [C++][Compute] Allow cast to reorder struct fields #45246
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
See also: |
|
7a3a569
to
c9a10fd
Compare
31a3e21
to
9b948ac
Compare
I think I'm happy with my implementation so I'm going to mark this ready for review. I understand that the concept may be slightly controversial though so I'm happy to discuss if anyone disagrees. |
Rationale for this change
When reading a parquet dataset where the physical schema has inconsistent column order for top level columns Arrow can still read the table. However it cannot handle similar inconsistency in the order of struct fields and raises errors like
This issue is quite closely related to #44555
What changes are included in this PR?
Change the implementation of
CastStruct::Exec
to be primarily based on the column names rather than the column order. Each input field can still only be used once and if there are many input fields with the same name they will be used in the order of the input fields.Alternatives I considered:
Implement this behaviour in the same place as the equivalent logic for top level columns at
arrow/cpp/src/arrow/compute/expression.cc
Line 669 in 6252e9c
I decided against this because I want this behaviour to work recursively e.g. if there are nested structs or structs inside arrays of maps, etc.
Have a config option to switch between field name and field order based matching. This would make things more explicit but there would be 2 code paths to maintain instead of one.
IMO the logic I've implemented where each input can only be used once and column order is maintained for duplicate names achieves what I want without breaking any usecases that rely on column order and without too much complexity. So I decided a config option was not necessary.
Are these changes tested?
Yes. A few new assertions were added but mostly it was a case of adjusting the expected behaviour on existing tests.
Are there any user-facing changes?
Yes. Casts that require changing the struct field order will now succeed without error.