fix: Reading a table with positional deletes should fail #826

Fokko · 2024-12-19T17:16:11Z

Here we fail:

Lines 449 to 454 in 74a85e7

    
           if manifest_entry_context.manifest_entry.content_type() != DataContentType::Data { 
        
               return Err(Error::new( 
        
                   ErrorKind::FeatureUnsupported, 
        
                   "Only Data files currently supported", 
        
               )); 
        
           }

But then the manifest is already filtered:

iceberg-rust/crates/iceberg/src/scan.rs

Line 625 in 74a85e7

.filter(|manifest_file| manifest_file.content == ManifestContentType::Data);

So it is unreachable, and it allows to return invalid data. Instead we should raise an error.

As always, don't hold back on my rust code :)

Fokko · 2024-12-19T18:10:25Z

crates/integration_tests/testdata/spark/provision.py

+spark = (
+    SparkSession
+        .builder
+        .config("spark.sql.shuffle.partitions", "1")


This is important, otherwise we get many small parquet files with a single row. When a positional delete hits the Parquet file with one row, the parquet file gets dropped instead of having a merge-on-read delete file.

nit: add this comment alongside the code itself

also curious why its needed here, we use the same logic in pyiceberg https://github.com/apache/iceberg-python/blob/1278e8880c4767287dc69208ced20bd444c37228/dev/provision.py#L25

I recall that I ran into the same thing with PyIceberg, but not sure how I fixed it there. Maybe good to add it there as well. Having a lot of partitions doesn't make any sense anyway, and will slow down the tests.

yep! 2 birds 1 stone :)

apache/iceberg-python#1417 (comment)

+1 for adding this comment along code.

I forgot to git add the comment 🤦 It is there now

kevinjqliu

Great catch! Added a few nit comments

kevinjqliu · 2024-12-19T19:34:29Z

crates/integration_tests/testdata/spark/provision.py

+spark = (
+    SparkSession
+        .builder
+        .config("spark.sql.shuffle.partitions", "1")


nit: add this comment alongside the code itself

also curious why its needed here, we use the same logic in pyiceberg https://github.com/apache/iceberg-python/blob/1278e8880c4767287dc69208ced20bd444c37228/dev/provision.py#L25

crates/iceberg/src/scan.rs

liurenjie1024

Thnaks @Fokko for this fix! Just left some nits, others LGTM!

crates/iceberg/src/scan.rs

liurenjie1024 · 2024-12-20T02:36:17Z

crates/integration_tests/testdata/spark/provision.py

+spark = (
+    SparkSession
+        .builder
+        .config("spark.sql.shuffle.partitions", "1")


+1 for adding this comment along code.

Xuanwo

Thank you @Fokko to get this fixed, and also thank you @kevinjqliu and @liurenjie1024 for review.

* A table with positional deletes shoulds fail * Add possible fix * Comment and refactor

Fokko force-pushed the fd-positional-deletes branch from a5570ea to fe2c359 Compare December 19, 2024 17:18

A table with positional deletes shoulds fail

87bb216

Fokko force-pushed the fd-positional-deletes branch from fe2c359 to 87bb216 Compare December 19, 2024 17:36

Fokko commented Dec 19, 2024

View reviewed changes

Fokko force-pushed the fd-positional-deletes branch 6 times, most recently from 87e111d to bac9253 Compare December 19, 2024 18:40

Add possible fix

1ee086c

Fokko force-pushed the fd-positional-deletes branch from bac9253 to 1ee086c Compare December 19, 2024 18:43

kevinjqliu previously approved these changes Dec 19, 2024

View reviewed changes

liurenjie1024 previously approved these changes Dec 20, 2024

View reviewed changes

Comment and refactor

fee5abf

Fokko dismissed stale reviews from liurenjie1024 and kevinjqliu via fee5abf December 20, 2024 07:10

Xuanwo approved these changes Dec 20, 2024

View reviewed changes

Xuanwo changed the title ~~Reading a table with positional deletes should fail~~ fix: Reading a table with positional deletes should fail Dec 20, 2024

Xuanwo marked this pull request as ready for review December 20, 2024 07:21

Xuanwo merged commit 0777fa7 into apache:main Dec 20, 2024
16 checks passed

Xuanwo mentioned this pull request Dec 20, 2024

Tracking issues of iceberg rust v0.4.0 Release #739

Closed

15 tasks

Fokko deleted the fd-positional-deletes branch December 20, 2024 07:48

Fokko added this to the 0.4.0 Release milestone Dec 20, 2024

sungwy pushed a commit that referenced this pull request Dec 20, 2024

fix: Reading a table with positional deletes should fail (#826)

908ad09

* A table with positional deletes shoulds fail * Add possible fix * Comment and refactor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Reading a table with positional deletes should fail #826

fix: Reading a table with positional deletes should fail #826

Fokko commented Dec 19, 2024 •

edited

Loading

Fokko Dec 19, 2024

kevinjqliu Dec 19, 2024

Fokko Dec 19, 2024

kevinjqliu Dec 19, 2024

liurenjie1024 Dec 20, 2024

Fokko Dec 20, 2024

kevinjqliu left a comment

kevinjqliu Dec 19, 2024

liurenjie1024 left a comment

liurenjie1024 Dec 20, 2024

Xuanwo left a comment

	if manifest_entry_context.manifest_entry.content_type() != DataContentType::Data {
	return Err(Error::new(
	ErrorKind::FeatureUnsupported,
	"Only Data files currently supported",
	));
	}

fix: Reading a table with positional deletes should fail #826

fix: Reading a table with positional deletes should fail #826

Conversation

Fokko commented Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xuanwo left a comment

Choose a reason for hiding this comment

Fokko commented Dec 19, 2024 •

edited

Loading