Preserve DV-specific fields for deletion vectors #14351

adawrapub · 2025-10-16T17:45:43Z

This is to fix problem reported in #13671 by preserving DV specific fields. Addressed all PR comments

Original PR got closed because of issues with branch. Please let me know if there is a better way to handle this
cc @anuragmantri @nastra and @amogh-jahagirdar

...k/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java

gaborkaszab · 2025-10-22T10:37:00Z

Thanks for pointing me to this PR, @nastra!

As I see there is going to be a format-version test dimension for the entire test suite. Let me introduce another dimension for the file formats similarly to the approach here. Not sure how much interference there is going to be between this PR and #14387. Whichever is merged earlier the other has to resolve git conflicts and also has done some overlapping work that isn't ideal.

Anyway, @adawrapub I think something went sideways with the formatting for the older Spark tests, as there are a lot of lines changing the indentation, and also the build fails because of that.

gaborkaszab · 2025-10-22T15:46:30Z

I'd like to echo an observation I write to the other PR: the runtime of this test suite seems long, introducing test dimensions could make it worse. I'd be careful here and only cover different format version where it matters and in fact adds to the test coverage.

Another thing: @adawrapub I saw that some of the tests aren't run with the new params (like testTableWithManyPartitionStatisticFile). If I'm not mistaken these are the ones that create the test table not by the @beforeeach functionality but by calling createMetastoreTable.

Just noting, that these tests won't pick up the format-version param.
Another comment that when I ran them parameterized I found that they aren't idempotent: the run for the first param succeeded, but the runs after failed. Probably some cleanup after the first run also cleaned up something needed by the runs after, because when I changed the table name to be unique between runs (e.g. adding the test param value to the name) then all the runs succeeded.

adawrapub · 2025-10-22T22:50:02Z

Thank you for the feedback @nastra and @gaborkaszab. I have updated the PR based on your comments.
Regarding the runtime of the test case, currently TestRewriteTablePathsAction is taking around 2 min to complete on my machine. But if it seems higher, we can split this test case into multiple classes, one class per format version. Please let me know what you think.

nastra · 2025-10-23T06:48:29Z

...k/v3.4/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java

+    return TestHelpers.V2_AND_ABOVE;
+  }
+
+  @Parameter private int formatVersion = 2;


Suggested change

@Parameter private int formatVersion = 2;

@Parameter private int formatVersion;

nastra · 2025-10-23T06:54:10Z

@adawrapub there are still a bunch of tests that have @Test instead of @TestTemplate. Also those tests are hardcoding the format-version, so please update those to use

-    Map<String, String> properties = Maps.newHashMap();
-    properties.put("format-version", "2");
+    Map<String, String> properties =
+        ImmutableMap.of(TableProperties.FORMAT_VERSION, String.valueOf(formatVersion));

gaborkaszab · 2025-10-23T07:49:20Z

I took another look and I'm still hesitant about the testing part:

I did an experiment where I removed the changes from FileMetadata and apparently the tests still succeed. In other words there is no test exercising the new code changes even with adding the new test dimension for format-version.
This PR is about a fix for path rewrite for DVs. Introducing a new suite-level dimension seems overkill for me. Not sure if I miss anything, but I think format-version only matters for V2 vs V3 positional deletes atm in terms of path rewriting. However, only a small fraction of the tests have in fact delete files (not to mention that these are the ones now that don't run with the new test dimension, so only test V2), so for most of the tests this new dimension doesn't add any extra value.
Not sure about what machine people use for development, but for me originally this suite runs for 4min55sec, with the new dimension it grows to 12min17sec, hence I'd be careful what dimensions we add.
If we say that this new parameter is relevant for V4 with the new structure of metadata, I get that, but it's out of scope for this PR in my opinion.

My proposal is to identify the subset of tests where the format version is relevant now, practically the ones having delete files, and add test param for them, but not for the entire suite. Unless I miss something :)

nastra · 2025-10-23T10:15:57Z

@gaborkaszab It takes me 1min 17 secs to run TestRewriteTablePathsAction on a MBP, so 4min 55secs seems super long to me for one dimension of testing.

In other words there is no test exercising the new code changes even with adding the new test dimension for format-version

This is because calls to FileHelpers.writePosDeleteFile / FileHelpers.writeDeleteFile have not been adjusted to pass the formatVersion in. @adawrapub can you please fix that?

@adawrapub also just preserving the DVs won't make the test suite pass automatically, because there's another issue that needs to be fixed so that rewriting table paths fully works with v3, which I mentioned in #14226 (comment)

My proposal is to identify the subset of tests where the format version is relevant now, practically the ones having delete files, and add test param for them, but not for the entire suite. Unless I miss something

The scope of #13671 is to fix DVs with rewriting table paths and it might seem ok to only parameterize the test methods that write DVs but it's possible that we're missing other subtle bugs around rewriting (since we're always doing this with v2 only), hence I'm advocating to parameterize the format version at the class level to make sure we're not missing anything. However, I understand the concern around long running test suites with a large test matrix, so we still need to keep this in mind and maybe selectively skip tests for certain format versions.

@adawrapub can you please apply the following diff?

diff --git a/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java b/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
index 8399ae5d52..be3e8b569e 100644
--- a/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
+++ b/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
@@ -22,6 +22,7 @@ import static org.apache.iceberg.spark.actions.RewriteTablePathSparkAction.NOT_A
 import static org.apache.iceberg.types.Types.NestedField.optional;
 import static org.assertj.core.api.Assertions.assertThat;
 import static org.assertj.core.api.Assertions.assertThatThrownBy;
+import static org.assertj.core.api.Assumptions.assumeThat;
 
 import java.io.File;
 import java.io.IOException;
@@ -39,6 +40,9 @@ import org.apache.iceberg.BaseTable;
 import org.apache.iceberg.DataFile;
 import org.apache.iceberg.DeleteFile;
 import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Parameter;
+import org.apache.iceberg.ParameterizedTestExtension;
+import org.apache.iceberg.Parameters;
 import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.Schema;
 import org.apache.iceberg.Snapshot;
@@ -59,6 +63,7 @@ import org.apache.iceberg.deletes.PositionDelete;
 import org.apache.iceberg.hadoop.HadoopTables;
 import org.apache.iceberg.io.FileIO;
 import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
 import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 import org.apache.iceberg.relocated.com.google.common.collect.Maps;
@@ -80,10 +85,12 @@ import org.apache.spark.storage.BlockManager;
 import org.apache.spark.storage.BroadcastBlockId;
 import org.junit.jupiter.api.AfterEach;
 import org.junit.jupiter.api.BeforeEach;
-import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.TestTemplate;
+import org.junit.jupiter.api.extension.ExtendWith;
 import org.junit.jupiter.api.io.TempDir;
 import scala.Tuple2;
 
+@ExtendWith(ParameterizedTestExtension.class)
 public class TestRewriteTablePathsAction extends TestBase {
 
   @TempDir private Path staging;
@@ -91,6 +98,13 @@ public class TestRewriteTablePathsAction extends TestBase {
   @TempDir private Path newTableDir;
   @TempDir private Path targetTableDir;
 
+  @Parameters(name = "formatVersion = {0}")
+  protected static List<Integer> formatVersions() {
+    return TestHelpers.V2_AND_ABOVE;
+  }
+
+  @Parameter private int formatVersion;
+
   protected ActionsProvider actions() {
     return SparkActions.get();
   }
@@ -135,7 +149,15 @@ public class TestRewriteTablePathsAction extends TestBase {
 
   private Table createTableWithSnapshots(
       String location, int snapshotNumber, Map<String, String> properties, String mode) {
-    Table newTable = TABLES.create(SCHEMA, PartitionSpec.unpartitioned(), properties, location);
+    Table newTable =
+        TABLES.create(
+            SCHEMA,
+            PartitionSpec.unpartitioned(),
+            ImmutableMap.<String, String>builder()
+                .put(TableProperties.FORMAT_VERSION, String.valueOf(formatVersion))
+                .putAll(properties)
+                .build(),
+            location);
 
     List<ThreeColumnRecord> records =
         Lists.newArrayList(new ThreeColumnRecord(1, "AAAAAAAAAA", "AAAA"));
@@ -159,7 +181,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     sql("DROP DATABASE IF EXISTS %s CASCADE", backupNs);
   }
 
-  @Test
+  @TestTemplate
   public void testRewritePath() throws Exception {
     String targetTableLocation = targetTableLocation();
 
@@ -207,7 +229,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     assertEquals("Rows should match after copy", expected, actual);
   }
 
-  @Test
+  @TestTemplate
   public void testSameLocations() {
     assertThatThrownBy(
             () ->
@@ -220,7 +242,7 @@ public class TestRewriteTablePathsAction extends TestBase {
         .hasMessageContaining("Source prefix cannot be the same as target prefix");
   }
 
-  @Test
+  @TestTemplate
   public void testStartVersion() throws Exception {
     RewriteTablePath.Result result =
         actions()
@@ -244,7 +266,7 @@ public class TestRewriteTablePathsAction extends TestBase {
         .isEmpty();
   }
 
-  @Test
+  @TestTemplate
   public void testIncrementalRewrite() throws Exception {
     String location = newTableLocation();
     Table sourceTable =
@@ -291,7 +313,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     assertEquals("Rows should match after copy", expected, actual);
   }
 
-  @Test
+  @TestTemplate
   public void testTableWith3Snapshots(@TempDir Path location1, @TempDir Path location2)
       throws Exception {
     String location = newTableLocation();
@@ -316,7 +338,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(3, 3, 3, 12, result1);
   }
 
-  @Test
+  @TestTemplate
   public void testFullTableRewritePath() throws Exception {
     RewriteTablePath.Result result =
         actions()
@@ -327,7 +349,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(3, 2, 2, 9, result);
   }
 
-  @Test
+  @TestTemplate
   public void testManifestRewriteAndIncrementalCopy() throws Exception {
     RewriteTablePath.Result initialResult =
         actions()
@@ -354,7 +376,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(1, 1, addedManifest, 3, postReweiteResult);
   }
 
-  @Test
+  @TestTemplate
   public void testDeleteDataFile() throws Exception {
     List<String> validDataFiles =
         spark
@@ -386,7 +408,7 @@ public class TestRewriteTablePathsAction extends TestBase {
         .hasSize(1);
   }
 
-  @Test
+  @TestTemplate
   public void testPositionDeletes() throws Exception {
     List<Pair<CharSequence, Long>> deletes =
         Lists.newArrayList(
@@ -397,7 +419,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     File file = new File(removePrefix(table.location() + "/data/deeply/nested/deletes.parquet"));
     DeleteFile positionDeletes =
         FileHelpers.writeDeleteFile(
-                table, table.io().newOutputFile(file.toURI().toString()), deletes)
+                table, table.io().newOutputFile(file.toURI().toString()), deletes, formatVersion)
             .first();
 
     table.newRowDelta().addDeletes(positionDeletes).commit();
@@ -423,7 +445,7 @@ public class TestRewriteTablePathsAction extends TestBase {
         .hasSize(1);
   }
 
-  @Test
+  @TestTemplate
   public void testPositionDeleteWithRow() throws Exception {
     String dataFileLocation =
         table.currentSnapshot().addedDataFiles(table.io()).iterator().next().location();
@@ -436,7 +458,8 @@ public class TestRewriteTablePathsAction extends TestBase {
                     .toURI()
                     .toString());
     deletes.add(positionDelete(SCHEMA, dataFileLocation, 0L, 1, "AAAAAAAAAA", "AAAA"));
-    DeleteFile positionDeletes = FileHelpers.writePosDeleteFile(table, deleteFile, null, deletes);
+    DeleteFile positionDeletes =
+        FileHelpers.writePosDeleteFile(table, deleteFile, null, deletes, formatVersion);
     table.newRowDelta().addDeletes(positionDeletes).commit();
 
     assertThat(spark.read().format("iceberg").load(table.location()).collectAsList()).hasSize(1);
@@ -465,8 +488,11 @@ public class TestRewriteTablePathsAction extends TestBase {
         .hasSize(1);
   }
 
-  @Test
+  @TestTemplate
   public void testPositionDeletesAcrossFiles() throws Exception {
+    assumeThat(formatVersion)
+        .as("Can't write multiple deletes into a single v3 delete file")
+        .isEqualTo(2);
     Stream<DataFile> allFiles =
         StreamSupport.stream(table.snapshots().spliterator(), false)
             .flatMap(s -> StreamSupport.stream(s.addedDataFiles(table.io()).spliterator(), false));
@@ -479,7 +505,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     File file = new File(removePrefix(table.location() + "/data/deeply/nested/file.parquet"));
     DeleteFile positionDeletes =
         FileHelpers.writeDeleteFile(
-                table, table.io().newOutputFile(file.toURI().toString()), deletes)
+                table, table.io().newOutputFile(file.toURI().toString()), deletes, formatVersion)
             .first();
 
     table.newRowDelta().addDeletes(positionDeletes).commit();
@@ -504,7 +530,7 @@ public class TestRewriteTablePathsAction extends TestBase {
         .isEmpty();
   }
 
-  @Test
+  @TestTemplate
   public void testEqualityDeletes() throws Exception {
     Table sourceTable = createTableWithSnapshots(newTableLocation(), 1);
 
@@ -559,7 +585,7 @@ public class TestRewriteTablePathsAction extends TestBase {
         .hasSize(2);
   }
 
-  @Test
+  @TestTemplate
   public void testFullTableRewritePathWithDeletedVersionFiles() throws Exception {
     String location = newTableLocation();
     Table sourceTable = createTableWithSnapshots(location, 2);
@@ -605,7 +631,7 @@ public class TestRewriteTablePathsAction extends TestBase {
         result);
   }
 
-  @Test
+  @TestTemplate
   public void testRewritePathWithoutSnapshot() throws Exception {
     RewriteTablePath.Result result =
         actions()
@@ -618,7 +644,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(1, 0, 0, 1, result);
   }
 
-  @Test
+  @TestTemplate
   public void testExpireSnapshotBeforeRewrite() throws Exception {
     // expire one snapshot
     actions().expireSnapshots(table).expireSnapshotId(table.currentSnapshot().parentId()).execute();
@@ -633,7 +659,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(4, 1, 2, 9, result);
   }
 
-  @Test
+  @TestTemplate
   public void testRewritePathWithNonLiveEntry() throws Exception {
     String location = newTableLocation();
     // first overwrite generate 1 manifest and 1 data file
@@ -698,7 +724,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     assertThat(copiedEntries).contains(deletedDataFilePathInTargetLocation);
   }
 
-  @Test
+  @TestTemplate
   public void testStartSnapshotWithoutValidSnapshot() throws Exception {
     // expire one snapshot
     actions().expireSnapshots(table).expireSnapshotId(table.currentSnapshot().parentId()).execute();
@@ -717,7 +743,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(2, 1, 1, 5, result);
   }
 
-  @Test
+  @TestTemplate
   public void testMoveTheVersionExpireSnapshot() throws Exception {
     // expire one snapshot
     actions().expireSnapshots(table).expireSnapshotId(table.currentSnapshot().parentId()).execute();
@@ -735,7 +761,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(1, 0, 0, 1, result);
   }
 
-  @Test
+  @TestTemplate
   public void testMoveVersionWithInvalidSnapshots() {
     // expire one snapshot
     actions().expireSnapshots(table).expireSnapshotId(table.currentSnapshot().parentId()).execute();
@@ -754,7 +780,7 @@ public class TestRewriteTablePathsAction extends TestBase {
                 + "Please choose an earlier version without invalid snapshots.");
   }
 
-  @Test
+  @TestTemplate
   public void testRollBack() throws Exception {
     long secondSnapshotId = table.currentSnapshot().snapshotId();
 
@@ -781,7 +807,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(6, 3, 3, 15, result);
   }
 
-  @Test
+  @TestTemplate
   public void testWriteAuditPublish() throws Exception {
     // enable WAP
     table.updateProperties().set(TableProperties.WRITE_AUDIT_PUBLISH_ENABLED, "true").commit();
@@ -806,7 +832,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(5, 3, 3, 14, result);
   }
 
-  @Test
+  @TestTemplate
   public void testSchemaChange() throws Exception {
     // change the schema
     table.updateSchema().addColumn("c4", Types.StringType.get()).commit();
@@ -823,7 +849,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(4, 2, 2, 10, result);
   }
 
-  @Test
+  @TestTemplate
   public void testSnapshotIdInheritanceEnabled() throws Exception {
     String sourceTableLocation = newTableLocation();
     Map<String, String> properties = Maps.newHashMap();
@@ -841,7 +867,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(3, 2, 2, 9, result);
   }
 
-  @Test
+  @TestTemplate
   public void testMetadataCompression() throws Exception {
     String sourceTableLocation = newTableLocation();
     Map<String, String> properties = Maps.newHashMap();
@@ -867,7 +893,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(2, 2, 2, 8, result);
   }
 
-  @Test
+  @TestTemplate
   public void testInvalidArgs() {
     RewriteTablePath actions = actions().rewriteTablePath(table);
 
@@ -900,12 +926,12 @@ public class TestRewriteTablePathsAction extends TestBase {
         .hasMessageContaining("End version('null') cannot be empty");
   }
 
-  @Test
+  @TestTemplate
   public void testTableWithManyPartitionStatisticFile() throws IOException {
     String sourceTableLocation = newTableLocation();
-    Map<String, String> properties = Maps.newHashMap();
-    properties.put("format-version", "2");
-    String tableName = "v2tblwithPartStats";
+    Map<String, String> properties =
+        ImmutableMap.of(TableProperties.FORMAT_VERSION, String.valueOf(formatVersion));
+    String tableName = String.format("v%stblwithPartStats", formatVersion);
     Table sourceTable =
         createMetastoreTable(sourceTableLocation, properties, "default", tableName, 0, "c1");
 
@@ -932,12 +958,12 @@ public class TestRewriteTablePathsAction extends TestBase {
         result, "partition-stats", sourceTableLocation, targetTableLocation);
   }
 
-  @Test
+  @TestTemplate
   public void testTableWithManyStatisticFiles() throws IOException {
     String sourceTableLocation = newTableLocation();
-    Map<String, String> properties = Maps.newHashMap();
-    properties.put("format-version", "2");
-    String tableName = "v2tblwithmanystats";
+    Map<String, String> properties =
+        ImmutableMap.of(TableProperties.FORMAT_VERSION, String.valueOf(formatVersion));
+    String tableName = String.format("v%stblwithmanystats", formatVersion);
     Table sourceTable =
         createMetastoreTable(sourceTableLocation, properties, "default", tableName, 0);
 
@@ -961,12 +987,12 @@ public class TestRewriteTablePathsAction extends TestBase {
         iterations * 2 + 1, iterations, iterations, iterations, iterations * 6 + 1, result);
   }
 
-  @Test
+  @TestTemplate
   public void testStatisticsFileSourcePath() throws IOException {
     String sourceTableLocation = newTableLocation();
-    Map<String, String> properties = Maps.newHashMap();
-    properties.put("format-version", "2");
-    String tableName = "v2tblwithstats";
+    Map<String, String> properties =
+        ImmutableMap.of(TableProperties.FORMAT_VERSION, String.valueOf(formatVersion));
+    String tableName = String.format("v%stblwithstats", formatVersion);
     Table sourceTable =
         createMetastoreTable(sourceTableLocation, properties, "default", tableName, 1);
 
@@ -989,13 +1015,17 @@ public class TestRewriteTablePathsAction extends TestBase {
     findAndAssertFileInFileList(result, ".stats", sourceTableLocation, targetTableLocation);
   }
 
-  @Test
+  @TestTemplate
   public void testMetadataCompressionWithMetastoreTable() throws Exception {
     Map<String, String> properties = Maps.newHashMap();
     properties.put(TableProperties.METADATA_COMPRESSION, "gzip");
     Table sourceTable =
         createMetastoreTable(
-            newTableLocation(), properties, "default", "testMetadataCompression", 2);
+            newTableLocation(),
+            properties,
+            "default",
+            String.format("v%sMetadataCompression", formatVersion),
+            2);
 
     TableMetadata currentMetadata = currentMetadata(sourceTable);
 
@@ -1023,10 +1053,11 @@ public class TestRewriteTablePathsAction extends TestBase {
   }
 
   // Metastore table tests
-  @Test
+  @TestTemplate
   public void testMetadataLocationChange() throws Exception {
+    String tableName = String.format("v%stblWithLocation", formatVersion);
     Table sourceTable =
-        createMetastoreTable(newTableLocation(), Maps.newHashMap(), "default", "tbl", 1);
+        createMetastoreTable(newTableLocation(), Maps.newHashMap(), "default", tableName, 1);
     String metadataFilePath = currentMetadata(sourceTable).metadataFileLocation();
 
     String newMetadataDir = "new-metadata-dir";
@@ -1035,7 +1066,7 @@ public class TestRewriteTablePathsAction extends TestBase {
         .set(TableProperties.WRITE_METADATA_LOCATION, newTableLocation() + newMetadataDir)
         .commit();
 
-    spark.sql("insert into hive.default.tbl values (1, 'AAAAAAAAAA', 'AAAA')");
+    sql("insert into hive.default.%s values (1, 'AAAAAAAAAA', 'AAAA')", tableName);
     sourceTable.refresh();
 
     // copy table
@@ -1068,12 +1099,15 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(2, 1, 1, 5, result2);
   }
 
-  @Test
+  @TestTemplate
   public void testDeleteFrom() throws Exception {
-    Map<String, String> properties = Maps.newHashMap();
-    properties.put("format-version", "2");
-    properties.put("write.delete.mode", "merge-on-read");
-    String tableName = "v2tbl";
+    Map<String, String> properties =
+        ImmutableMap.of(
+            TableProperties.FORMAT_VERSION,
+            String.valueOf(formatVersion),
+            "write.delete.mode",
+            "merge-on-read");
+    String tableName = String.format("v%stbl", formatVersion);
     Table sourceTable =
         createMetastoreTable(newTableLocation(), properties, "default", tableName, 0);
     // ingest data
@@ -1137,7 +1171,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     assertEquals("Rows must match", originalData, copiedData);
   }
 
-  @Test
+  @TestTemplate
   public void testKryoDeserializeBroadcastValues() {
     sparkContext.getConf().set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
     RewriteTablePathSparkAction action =
@@ -1148,8 +1182,11 @@ public class TestRewriteTablePathsAction extends TestBase {
     assertThat(tableBroadcast.getValue().uuid()).isEqualTo(table.uuid());
   }
 
-  @Test
+  @TestTemplate
   public void testNestedDirectoryStructurePreservation() throws Exception {
+    assumeThat(formatVersion)
+        .as("Can't add multiple DVs for the same data file in v3")
+        .isEqualTo(2);
     String sourceTableLocation = newTableLocation();
     Table sourceTable = createTableWithSnapshots(sourceTableLocation, 1);
 
@@ -1186,12 +1223,18 @@ public class TestRewriteTablePathsAction extends TestBase {
 
     DeleteFile positionDeletes1 =
         FileHelpers.writeDeleteFile(
-                sourceTable, sourceTable.io().newOutputFile(file1.toURI().toString()), deletes1)
+                sourceTable,
+                sourceTable.io().newOutputFile(file1.toURI().toString()),
+                deletes1,
+                formatVersion)
             .first();
 
     DeleteFile positionDeletes2 =
         FileHelpers.writeDeleteFile(
-                sourceTable, sourceTable.io().newOutputFile(file2.toURI().toString()), deletes2)
+                sourceTable,
+                sourceTable.io().newOutputFile(file2.toURI().toString()),
+                deletes2,
+                formatVersion)
             .first();
 
     sourceTable.newRowDelta().addDeletes(positionDeletes1).commit();
@@ -1235,7 +1278,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     assertThat(targetPath2).startsWith(targetTableLocation());
   }
 
-  @Test
+  @TestTemplate
   public void testRewritePathWithoutCreateFileList() throws Exception {
     String targetTableLocation = targetTableLocation();

This should make all test run with v2 and v3 and give you only a handful of failing tests that don't work with DVs

gaborkaszab · 2025-10-23T12:43:17Z

Thanks for the explanation, @nastra ! Sure, let's add the dimension on the suite level! Another takeaway, that I have to check why these run that slow on my MBP :)

adawrapub · 2025-10-27T23:11:42Z

@nastra Can you please review again. Updated based on feedback

nastra · 2025-10-28T07:27:00Z

@adawrapub I'm a little confused on why you don't have failing tests with v3. Did you apply the diff I mentioned fully?

adawrapub · 2025-10-28T21:30:38Z

@adawrapub I'm a little confused on why you don't have failing tests with v3. Did you apply the diff I mentioned fully?

Hello @nastra Thank you for your response and patch. I did apply all your changes which caused some of the test cases to fail. My understanding is Format v3+ uses PUFFIN format for position deletes. The positionDeletesReader method doesn't support PUFFIN format. Hence in test cases like testPositionDeleteWithRow, testPositionDeletes, testDeleteFrom, I added assumeThat(formatVersion) to not run for v3 and v4. Let me know if there is a better way to fix this. I did update FileHelpers.writePosDeleteFile and FileHelpers.writeDeleteFile to take format version.

nastra · 2025-10-29T07:25:59Z

Hello @nastra Thank you for your response and patch. I did apply all your changes which caused some of the test cases to fail. My understanding is Format v3+ uses PUFFIN format for position deletes. The positionDeletesReader method doesn't support PUFFIN format. Hence in test cases like testPositionDeleteWithRow, testPositionDeletes, testDeleteFrom, I added assumeThat(formatVersion) to not run for v3 and v4. Let me know if there is a better way to fix this. I did update FileHelpers.writePosDeleteFile and FileHelpers.writeDeleteFile to take format version.

Those remaining failing tests is the indicator of the underlying issue that needs to be fixed as part of this PR. We don't want to skip those tests for v3+ but we need to make rewriting table paths work for v3+, which is the scope of #13671 and which I also mentioned in #14226 (comment). You need to apply a similar approach to what has been done in #11657 in order to allow reading v3 deletes and thus make those tests pass

…eparate locations: Manifest Metadata: DeleteFile.referencedDataFile() field Puffin Blob Metadata: "referenced-data-file" property inside the blob The original implementation only updated the manifest metadata. The Puffin blob metadata still contained the old path, causing the DV reader to fail when applying deletes at the new location. I have implemented a two-pronged update strategy. Please let me know if there is a better way 1. Manifest Metadata Update Added ContentFileUtil.replaceReferencedDataFile() utility method Created RewriteTablePathUtil.newPositionDeleteEntry() helper method Updates the referencedDataFile field in manifest entries 2. Puffin Content Update Implemented RewriteTablePathUtil.rewriteDVFile() method Reads Puffin files, updates blob metadata properties, writes new files Preserves the bitmap data while updating path references

adawrapub · 2025-11-04T00:13:50Z

Hello @nastra Thank you for your response and patch. I did apply all your changes which caused some of the test cases to fail. My understanding is Format v3+ uses PUFFIN format for position deletes. The positionDeletesReader method doesn't support PUFFIN format. Hence in test cases like testPositionDeleteWithRow, testPositionDeletes, testDeleteFrom, I added assumeThat(formatVersion) to not run for v3 and v4. Let me know if there is a better way to fix this. I did update FileHelpers.writePosDeleteFile and FileHelpers.writeDeleteFile to take format version.

Those remaining failing tests is the indicator of the underlying issue that needs to be fixed as part of this PR. We don't want to skip those tests for v3+ but we need to make rewriting table paths work for v3+, which is the scope of #13671 and which I also mentioned in #14226 (comment). You need to apply a similar approach to what has been done in #11657 in order to allow reading v3 deletes and thus make those tests pass

@nastra As I understood DV files (Puffin format) store the referenced data file path in two separate locations:
Manifest Metadata: DeleteFile.referencedDataFile() field
Puffin Blob Metadata: "referenced-data-file" property inside the blob
The original implementation only updated the manifest metadata. The Puffin blob metadata still contained the old path, causing the DV reader to fail when applying deletes at the new location.

I have implemented a two-pronged update strategy. Please let me know if there is a better way

Manifest Metadata Update
Added ContentFileUtil.replaceReferencedDataFile() utility method
Created RewriteTablePathUtil.newPositionDeleteEntry() helper method
Updates the referencedDataFile field in manifest entries
Puffin Content Update
Implemented RewriteTablePathUtil.rewriteDVFile() method
Reads Puffin files, updates blob metadata properties, writes new files
Preserves the bitmap data while updating path references

nastra · 2025-11-04T07:48:43Z

core/src/main/java/org/apache/iceberg/util/ContentFileUtil.java

+   * @param targetPrefix target prefix that will replace it
+   * @return updated referenced data file path, or null if not applicable
+   */
+  public static String replaceReferencedDataFile(


probably best to keep this internally in RewriteTablePathUtil

@nastra Updated. Please let know if it looks good to you

…back

nastra · 2025-11-07T10:25:41Z

@adawrapub can you resolve conflicts please so that CI runs?

nastra · 2025-11-07T10:27:33Z

core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java

+   * @param targetPrefix target prefix that will replace it
+   * @return updated referenced data file path, or null if not applicable
+   */
+  private static String replaceReferencedDataFile(


Suggested change

private static String replaceReferencedDataFile(

private static String rewriteReferencedDataFilePathForDV(

nastra · 2025-11-07T10:30:44Z

core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java

+      throws IOException {
+    InputFile sourceFile = io.newInputFile(deleteFile.location());
+
+    try (org.apache.iceberg.puffin.PuffinReader reader =


Suggested change

try (org.apache.iceberg.puffin.PuffinReader reader =

try (PuffinReader reader =

nastra · 2025-11-07T10:30:56Z

core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java

+
+      List<org.apache.iceberg.puffin.BlobMetadata> blobs = reader.fileMetadata().blobs();
+
+      try (org.apache.iceberg.puffin.PuffinWriter writer =


Suggested change

try (org.apache.iceberg.puffin.PuffinWriter writer =

try (PuffinWriter writer =

nastra · 2025-11-07T10:31:07Z

core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java

+      List<org.apache.iceberg.puffin.BlobMetadata> blobs = reader.fileMetadata().blobs();
+
+      try (org.apache.iceberg.puffin.PuffinWriter writer =
+          org.apache.iceberg.puffin.Puffin.write(outputFile)


Suggested change

org.apache.iceberg.puffin.Puffin.write(outputFile)

Puffin.write(outputFile)

nastra · 2025-11-07T10:34:20Z

core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java

+      throws IOException {
+    InputFile sourceFile = io.newInputFile(deleteFile.location());
+
+    try (org.apache.iceberg.puffin.PuffinReader reader =


I think it would be better to not mix the reader and writer here. First read the blobs, then rewrite them and store them in a list and only then init the writer to write the new blobs out

can you please apply the below diff?

private static void rewriteDVFile( DeleteFile deleteFile, OutputFile outputFile, FileIO io, String sourcePrefix, String targetPrefix) throws IOException { List<Blob> rewrittenBlobs = Lists.newArrayList(); try (PuffinReader reader = Puffin.read(io.newInputFile(deleteFile.location())).build()) { // Read all blobs and rewrite them with updated referenced data file paths for (Pair<org.apache.iceberg.puffin.BlobMetadata, ByteBuffer> blobPair : reader.readAll(reader.fileMetadata().blobs())) { org.apache.iceberg.puffin.BlobMetadata blobMetadata = blobPair.first(); ByteBuffer blobData = blobPair.second(); // Get the original properties and update the referenced data file path Map<String, String> properties = Maps.newHashMap(blobMetadata.properties()); String referencedDataFile = properties.get("referenced-data-file"); if (referencedDataFile != null && referencedDataFile.startsWith(sourcePrefix)) { String newReferencedDataFile = newPath(referencedDataFile, sourcePrefix, targetPrefix); properties.put("referenced-data-file", newReferencedDataFile); } // Create a new blob with updated properties rewrittenBlobs.add( new Blob( blobMetadata.type(), blobMetadata.inputFields(), blobMetadata.snapshotId(), blobMetadata.sequenceNumber(), blobData, PuffinCompressionCodec.forName(blobMetadata.compressionCodec()), properties)); } } try (PuffinWriter writer = Puffin.write(outputFile).createdBy(IcebergBuild.fullVersion()).build()) { rewrittenBlobs.forEach(writer::write); } }

nastra · 2025-11-07T10:47:08Z

core/src/main/java/org/apache/iceberg/RewriteTablePathUtil.java

+                  blobMetadata.snapshotId(),
+                  blobMetadata.sequenceNumber(),
+                  blobData,
+                  null, // compression codec (keep uncompressed)


Suggested change

null, // compression codec (keep uncompressed)

PuffinCompressionCodec.forName(blobMetadata.compressionCodec()),

Changes to address PR comments

7dd10b0

github-actions bot added spark core labels Oct 16, 2025