-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Reading a table with positional deletes should fail #826
Conversation
a5570ea
to
fe2c359
Compare
fe2c359
to
87bb216
Compare
spark = ( | ||
SparkSession | ||
.builder | ||
.config("spark.sql.shuffle.partitions", "1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is important, otherwise we get many small parquet files with a single row. When a positional delete hits the Parquet file with one row, the parquet file gets dropped instead of having a merge-on-read delete file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add this comment alongside the code itself
also curious why its needed here, we use the same logic in pyiceberg https://github.com/apache/iceberg-python/blob/1278e8880c4767287dc69208ced20bd444c37228/dev/provision.py#L25
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall that I ran into the same thing with PyIceberg, but not sure how I fixed it there. Maybe good to add it there as well. Having a lot of partitions doesn't make any sense anyway, and will slow down the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep! 2 birds 1 stone :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for adding this comment along code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to git add
the comment 🤦 It is there now
87e111d
to
bac9253
Compare
bac9253
to
1ee086c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! Added a few nit comments
spark = ( | ||
SparkSession | ||
.builder | ||
.config("spark.sql.shuffle.partitions", "1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add this comment alongside the code itself
also curious why its needed here, we use the same logic in pyiceberg https://github.com/apache/iceberg-python/blob/1278e8880c4767287dc69208ced20bd444c37228/dev/provision.py#L25
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thnaks @Fokko for this fix! Just left some nits, others LGTM!
spark = ( | ||
SparkSession | ||
.builder | ||
.config("spark.sql.shuffle.partitions", "1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for adding this comment along code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Fokko to get this fixed, and also thank you @kevinjqliu and @liurenjie1024 for review.
* A table with positional deletes shoulds fail * Add possible fix * Comment and refactor
Here we fail:
iceberg-rust/crates/iceberg/src/scan.rs
Lines 449 to 454 in 74a85e7
But then the manifest is already filtered:
iceberg-rust/crates/iceberg/src/scan.rs
Line 625 in 74a85e7
So it is unreachable, and it allows to return invalid data. Instead we should raise an error.
As always, don't hold back on my rust code :)