Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use physical name for column name lookup in partitions #1836

Merged
merged 37 commits into from
Nov 24, 2023

Conversation

aersam
Copy link
Contributor

@aersam aersam commented Nov 10, 2023

Description

get_actions wrongly assumes that partition_columns from schema and partitionValues from log must be the same. This is not true since partition_columns are logical column names while partitionValues are physical column names.

Tests pending

Related Issue(s)

Documentation

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#writer-requirements-for-column-mapping
"Track partition values and column level statistics with the physical name of the column in the transaction log."

@github-actions github-actions bot added binding/rust Issues for the Rust crate crate/core labels Nov 10, 2023
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@aersam aersam changed the title Fix https://github.com/delta-io/delta-rs/issues/1835 fix: use physical name for column name lookup in partitions Nov 10, 2023
@aersam
Copy link
Contributor Author

aersam commented Nov 10, 2023

seems to be compiling in CI/CD.

Where would you propose to add a test for this?

Would also have to check if the issue exists with statistics as well

@aersam aersam marked this pull request as ready for review November 10, 2023 14:40
Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing, I think the tests belong here:

crates/deltalake-core/src/table/state_arrow.rs Outdated Show resolved Hide resolved
crates/deltalake-core/src/kernel/schema.rs Outdated Show resolved Hide resolved
@aersam
Copy link
Contributor Author

aersam commented Nov 13, 2023

For testing, I think the tests belong here:

Ok, I'll add some tests. How am I supposed to execute those locally? Always get strange compile errors locally. I'm using Windows

@roeap
Copy link
Collaborator

roeap commented Nov 13, 2023

Always get strange compile errors locally. I'm using Windows

There is currently a build issue on main which may be related. Could you try with the datafusion flag enabled?

@aersam
Copy link
Contributor Author

aersam commented Nov 13, 2023

Always get strange compile errors locally. I'm using Windows

There is currently a build issue on main which may be related. Could you try with the datafusion flag enabled?

Does not help, however in WSL it works just with cargo test. Tests seem to be passing now, so if you want to review ;)

@aersam
Copy link
Contributor Author

aersam commented Nov 13, 2023

I don't think the build error is related to my changes? At least I would not understand it 🙂

Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aersam for getting us started on column mapping!

Left a comment around error handling, but overall looking good.

One thing I noticed is that there is a lot going on in the example data log. Would it maybe be an option to set up the column mapping on creating the table, which makes it a bit easier to reason about the relevant commit.

crates/deltalake-core/src/kernel/schema.rs Outdated Show resolved Hide resolved
crates/deltalake-core/src/table/state_arrow.rs Outdated Show resolved Hide resolved
@aersam
Copy link
Contributor Author

aersam commented Nov 14, 2023

Thanks @aersam for getting us started on column mapping!

Left a comment around error handling, but overall looking good.

One thing I noticed is that there is a lot going on in the example data log. Would it maybe be an option to set up the column mapping on creating the table, which makes it a bit easier to reason about the relevant commit.

I now simplified the test file, it's a lot less files now.
So it's ready for the next review :)

@aersam aersam requested a review from roeap November 14, 2023 07:36
@aersam
Copy link
Contributor Author

aersam commented Nov 14, 2023

Always get strange compile errors locally. I'm using Windows

There is currently a build issue on main which may be related. Could you try with the datafusion flag enabled?

One comment here: the error disappeared on Windows when doing cargo update

Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the log clean-up - much easier to read now 👍.

Clarified a bit on why to remove the panic and handle via a Result.

crates/deltalake-core/src/kernel/schema.rs Outdated Show resolved Hide resolved
@aersam
Copy link
Contributor Author

aersam commented Nov 15, 2023

Thanks for the log clean-up - much easier to read now 👍.

Clarified a bit on why to remove the panic and handle via a Result.

In addition to a lot more error handling, I also added a check for delta.columnMapping.mode which matches the spec. If column mapping mode=none, we should ignore physical name entirely

@aersam aersam requested a review from roeap November 15, 2023 09:28
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good!

One more ask is to handle the column mapping analogous to the rest of the table configuration.

crates/deltalake-core/src/table/mod.rs Outdated Show resolved Hide resolved
@aersam aersam requested a review from roeap November 21, 2023 09:01
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with it and your patience @aersam!

@roeap roeap merged commit ba043d6 into delta-io:main Nov 24, 2023
21 checks passed
@aersam
Copy link
Contributor Author

aersam commented Nov 24, 2023

Thanks for sticking with it and your patience @aersam!

Thanks for your guidance and for merging it!

ion-elgreco pushed a commit to ion-elgreco/delta-rs that referenced this pull request Nov 25, 2023
…#1836)

# Description
get_actions wrongly assumes that partition_columns from schema and
partitionValues from log must be the same. This is not true since
partition_columns are logical column names while partitionValues are
physical column names.

Tests pending

# Related Issue(s)

- closes delta-io#1835

# Documentation

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#writer-requirements-for-column-mapping
"Track partition values and column level statistics with the physical
name of the column in the transaction log."

---------

Co-authored-by: Will Jones <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate crate/core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

get_add_actions fails with deltalake 0.13
3 participants