Skip to content

Conversation

@adawrapub
Copy link

This is to fix problem reported in #13671 by preserving DV specific fields. Addressed all PR comments

Original PR got closed because of issues with branch. Please let me know if there is a better way to handle this
cc @anuragmantri @nastra and @amogh-jahagirdar

@gaborkaszab
Copy link
Collaborator

gaborkaszab commented Oct 22, 2025

Thanks for pointing me to this PR, @nastra!

As I see there is going to be a format-version test dimension for the entire test suite. Let me introduce another dimension for the file formats similarly to the approach here. Not sure how much interference there is going to be between this PR and #14387. Whichever is merged earlier the other has to resolve git conflicts and also has done some overlapping work that isn't ideal.

Anyway, @adawrapub I think something went sideways with the formatting for the older Spark tests, as there are a lot of lines changing the indentation, and also the build fails because of that.

@gaborkaszab
Copy link
Collaborator

I'd like to echo an observation I write to the other PR: the runtime of this test suite seems long, introducing test dimensions could make it worse. I'd be careful here and only cover different format version where it matters and in fact adds to the test coverage.

Another thing: @adawrapub I saw that some of the tests aren't run with the new params (like testTableWithManyPartitionStatisticFile). If I'm not mistaken these are the ones that create the test table not by the @beforeeach functionality but by calling createMetastoreTable.

  • Just noting, that these tests won't pick up the format-version param.
  • Another comment that when I ran them parameterized I found that they aren't idempotent: the run for the first param succeeded, but the runs after failed. Probably some cleanup after the first run also cleaned up something needed by the runs after, because when I changed the table name to be unique between runs (e.g. adding the test param value to the name) then all the runs succeeded.

@adawrapub
Copy link
Author

Thank you for the feedback @nastra and @gaborkaszab. I have updated the PR based on your comments.
Regarding the runtime of the test case, currently TestRewriteTablePathsAction is taking around 2 min to complete on my machine. But if it seems higher, we can split this test case into multiple classes, one class per format version. Please let me know what you think.

return TestHelpers.V2_AND_ABOVE;
}

@Parameter private int formatVersion = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@Parameter private int formatVersion = 2;
@Parameter private int formatVersion;

@nastra
Copy link
Contributor

nastra commented Oct 23, 2025

@adawrapub there are still a bunch of tests that have @Test instead of @TestTemplate. Also those tests are hardcoding the format-version, so please update those to use

-    Map<String, String> properties = Maps.newHashMap();
-    properties.put("format-version", "2");
+    Map<String, String> properties =
+        ImmutableMap.of(TableProperties.FORMAT_VERSION, String.valueOf(formatVersion));

@gaborkaszab
Copy link
Collaborator

I took another look and I'm still hesitant about the testing part:

  1. I did an experiment where I removed the changes from FileMetadata and apparently the tests still succeed. In other words there is no test exercising the new code changes even with adding the new test dimension for format-version.
  2. This PR is about a fix for path rewrite for DVs. Introducing a new suite-level dimension seems overkill for me. Not sure if I miss anything, but I think format-version only matters for V2 vs V3 positional deletes atm in terms of path rewriting. However, only a small fraction of the tests have in fact delete files (not to mention that these are the ones now that don't run with the new test dimension, so only test V2), so for most of the tests this new dimension doesn't add any extra value.
    Not sure about what machine people use for development, but for me originally this suite runs for 4min55sec, with the new dimension it grows to 12min17sec, hence I'd be careful what dimensions we add.
    If we say that this new parameter is relevant for V4 with the new structure of metadata, I get that, but it's out of scope for this PR in my opinion.

My proposal is to identify the subset of tests where the format version is relevant now, practically the ones having delete files, and add test param for them, but not for the entire suite. Unless I miss something :)

@nastra
Copy link
Contributor

nastra commented Oct 23, 2025

@gaborkaszab It takes me 1min 17 secs to run TestRewriteTablePathsAction on a MBP, so 4min 55secs seems super long to me for one dimension of testing.

In other words there is no test exercising the new code changes even with adding the new test dimension for format-version

This is because calls to FileHelpers.writePosDeleteFile / FileHelpers.writeDeleteFile have not been adjusted to pass the formatVersion in. @adawrapub can you please fix that?

@adawrapub also just preserving the DVs won't make the test suite pass automatically, because there's another issue that needs to be fixed so that rewriting table paths fully works with v3, which I mentioned in #14226 (comment)

My proposal is to identify the subset of tests where the format version is relevant now, practically the ones having delete files, and add test param for them, but not for the entire suite. Unless I miss something

The scope of #13671 is to fix DVs with rewriting table paths and it might seem ok to only parameterize the test methods that write DVs but it's possible that we're missing other subtle bugs around rewriting (since we're always doing this with v2 only), hence I'm advocating to parameterize the format version at the class level to make sure we're not missing anything. However, I understand the concern around long running test suites with a large test matrix, so we still need to keep this in mind and maybe selectively skip tests for certain format versions.

@adawrapub can you please apply the following diff?

diff --git a/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java b/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
index 8399ae5d52..be3e8b569e 100644
--- a/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
+++ b/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
@@ -22,6 +22,7 @@ import static org.apache.iceberg.spark.actions.RewriteTablePathSparkAction.NOT_A
 import static org.apache.iceberg.types.Types.NestedField.optional;
 import static org.assertj.core.api.Assertions.assertThat;
 import static org.assertj.core.api.Assertions.assertThatThrownBy;
+import static org.assertj.core.api.Assumptions.assumeThat;
 
 import java.io.File;
 import java.io.IOException;
@@ -39,6 +40,9 @@ import org.apache.iceberg.BaseTable;
 import org.apache.iceberg.DataFile;
 import org.apache.iceberg.DeleteFile;
 import org.apache.iceberg.HasTableOperations;
+import org.apache.iceberg.Parameter;
+import org.apache.iceberg.ParameterizedTestExtension;
+import org.apache.iceberg.Parameters;
 import org.apache.iceberg.PartitionSpec;
 import org.apache.iceberg.Schema;
 import org.apache.iceberg.Snapshot;
@@ -59,6 +63,7 @@ import org.apache.iceberg.deletes.PositionDelete;
 import org.apache.iceberg.hadoop.HadoopTables;
 import org.apache.iceberg.io.FileIO;
 import org.apache.iceberg.io.OutputFile;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
 import org.apache.iceberg.relocated.com.google.common.collect.Iterables;
 import org.apache.iceberg.relocated.com.google.common.collect.Lists;
 import org.apache.iceberg.relocated.com.google.common.collect.Maps;
@@ -80,10 +85,12 @@ import org.apache.spark.storage.BlockManager;
 import org.apache.spark.storage.BroadcastBlockId;
 import org.junit.jupiter.api.AfterEach;
 import org.junit.jupiter.api.BeforeEach;
-import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.TestTemplate;
+import org.junit.jupiter.api.extension.ExtendWith;
 import org.junit.jupiter.api.io.TempDir;
 import scala.Tuple2;
 
+@ExtendWith(ParameterizedTestExtension.class)
 public class TestRewriteTablePathsAction extends TestBase {
 
   @TempDir private Path staging;
@@ -91,6 +98,13 @@ public class TestRewriteTablePathsAction extends TestBase {
   @TempDir private Path newTableDir;
   @TempDir private Path targetTableDir;
 
+  @Parameters(name = "formatVersion = {0}")
+  protected static List<Integer> formatVersions() {
+    return TestHelpers.V2_AND_ABOVE;
+  }
+
+  @Parameter private int formatVersion;
+
   protected ActionsProvider actions() {
     return SparkActions.get();
   }
@@ -135,7 +149,15 @@ public class TestRewriteTablePathsAction extends TestBase {
 
   private Table createTableWithSnapshots(
       String location, int snapshotNumber, Map<String, String> properties, String mode) {
-    Table newTable = TABLES.create(SCHEMA, PartitionSpec.unpartitioned(), properties, location);
+    Table newTable =
+        TABLES.create(
+            SCHEMA,
+            PartitionSpec.unpartitioned(),
+            ImmutableMap.<String, String>builder()
+                .put(TableProperties.FORMAT_VERSION, String.valueOf(formatVersion))
+                .putAll(properties)
+                .build(),
+            location);
 
     List<ThreeColumnRecord> records =
         Lists.newArrayList(new ThreeColumnRecord(1, "AAAAAAAAAA", "AAAA"));
@@ -159,7 +181,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     sql("DROP DATABASE IF EXISTS %s CASCADE", backupNs);
   }
 
-  @Test
+  @TestTemplate
   public void testRewritePath() throws Exception {
     String targetTableLocation = targetTableLocation();
 
@@ -207,7 +229,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     assertEquals("Rows should match after copy", expected, actual);
   }
 
-  @Test
+  @TestTemplate
   public void testSameLocations() {
     assertThatThrownBy(
             () ->
@@ -220,7 +242,7 @@ public class TestRewriteTablePathsAction extends TestBase {
         .hasMessageContaining("Source prefix cannot be the same as target prefix");
   }
 
-  @Test
+  @TestTemplate
   public void testStartVersion() throws Exception {
     RewriteTablePath.Result result =
         actions()
@@ -244,7 +266,7 @@ public class TestRewriteTablePathsAction extends TestBase {
         .isEmpty();
   }
 
-  @Test
+  @TestTemplate
   public void testIncrementalRewrite() throws Exception {
     String location = newTableLocation();
     Table sourceTable =
@@ -291,7 +313,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     assertEquals("Rows should match after copy", expected, actual);
   }
 
-  @Test
+  @TestTemplate
   public void testTableWith3Snapshots(@TempDir Path location1, @TempDir Path location2)
       throws Exception {
     String location = newTableLocation();
@@ -316,7 +338,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(3, 3, 3, 12, result1);
   }
 
-  @Test
+  @TestTemplate
   public void testFullTableRewritePath() throws Exception {
     RewriteTablePath.Result result =
         actions()
@@ -327,7 +349,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(3, 2, 2, 9, result);
   }
 
-  @Test
+  @TestTemplate
   public void testManifestRewriteAndIncrementalCopy() throws Exception {
     RewriteTablePath.Result initialResult =
         actions()
@@ -354,7 +376,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(1, 1, addedManifest, 3, postReweiteResult);
   }
 
-  @Test
+  @TestTemplate
   public void testDeleteDataFile() throws Exception {
     List<String> validDataFiles =
         spark
@@ -386,7 +408,7 @@ public class TestRewriteTablePathsAction extends TestBase {
         .hasSize(1);
   }
 
-  @Test
+  @TestTemplate
   public void testPositionDeletes() throws Exception {
     List<Pair<CharSequence, Long>> deletes =
         Lists.newArrayList(
@@ -397,7 +419,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     File file = new File(removePrefix(table.location() + "/data/deeply/nested/deletes.parquet"));
     DeleteFile positionDeletes =
         FileHelpers.writeDeleteFile(
-                table, table.io().newOutputFile(file.toURI().toString()), deletes)
+                table, table.io().newOutputFile(file.toURI().toString()), deletes, formatVersion)
             .first();
 
     table.newRowDelta().addDeletes(positionDeletes).commit();
@@ -423,7 +445,7 @@ public class TestRewriteTablePathsAction extends TestBase {
         .hasSize(1);
   }
 
-  @Test
+  @TestTemplate
   public void testPositionDeleteWithRow() throws Exception {
     String dataFileLocation =
         table.currentSnapshot().addedDataFiles(table.io()).iterator().next().location();
@@ -436,7 +458,8 @@ public class TestRewriteTablePathsAction extends TestBase {
                     .toURI()
                     .toString());
     deletes.add(positionDelete(SCHEMA, dataFileLocation, 0L, 1, "AAAAAAAAAA", "AAAA"));
-    DeleteFile positionDeletes = FileHelpers.writePosDeleteFile(table, deleteFile, null, deletes);
+    DeleteFile positionDeletes =
+        FileHelpers.writePosDeleteFile(table, deleteFile, null, deletes, formatVersion);
     table.newRowDelta().addDeletes(positionDeletes).commit();
 
     assertThat(spark.read().format("iceberg").load(table.location()).collectAsList()).hasSize(1);
@@ -465,8 +488,11 @@ public class TestRewriteTablePathsAction extends TestBase {
         .hasSize(1);
   }
 
-  @Test
+  @TestTemplate
   public void testPositionDeletesAcrossFiles() throws Exception {
+    assumeThat(formatVersion)
+        .as("Can't write multiple deletes into a single v3 delete file")
+        .isEqualTo(2);
     Stream<DataFile> allFiles =
         StreamSupport.stream(table.snapshots().spliterator(), false)
             .flatMap(s -> StreamSupport.stream(s.addedDataFiles(table.io()).spliterator(), false));
@@ -479,7 +505,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     File file = new File(removePrefix(table.location() + "/data/deeply/nested/file.parquet"));
     DeleteFile positionDeletes =
         FileHelpers.writeDeleteFile(
-                table, table.io().newOutputFile(file.toURI().toString()), deletes)
+                table, table.io().newOutputFile(file.toURI().toString()), deletes, formatVersion)
             .first();
 
     table.newRowDelta().addDeletes(positionDeletes).commit();
@@ -504,7 +530,7 @@ public class TestRewriteTablePathsAction extends TestBase {
         .isEmpty();
   }
 
-  @Test
+  @TestTemplate
   public void testEqualityDeletes() throws Exception {
     Table sourceTable = createTableWithSnapshots(newTableLocation(), 1);
 
@@ -559,7 +585,7 @@ public class TestRewriteTablePathsAction extends TestBase {
         .hasSize(2);
   }
 
-  @Test
+  @TestTemplate
   public void testFullTableRewritePathWithDeletedVersionFiles() throws Exception {
     String location = newTableLocation();
     Table sourceTable = createTableWithSnapshots(location, 2);
@@ -605,7 +631,7 @@ public class TestRewriteTablePathsAction extends TestBase {
         result);
   }
 
-  @Test
+  @TestTemplate
   public void testRewritePathWithoutSnapshot() throws Exception {
     RewriteTablePath.Result result =
         actions()
@@ -618,7 +644,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(1, 0, 0, 1, result);
   }
 
-  @Test
+  @TestTemplate
   public void testExpireSnapshotBeforeRewrite() throws Exception {
     // expire one snapshot
     actions().expireSnapshots(table).expireSnapshotId(table.currentSnapshot().parentId()).execute();
@@ -633,7 +659,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(4, 1, 2, 9, result);
   }
 
-  @Test
+  @TestTemplate
   public void testRewritePathWithNonLiveEntry() throws Exception {
     String location = newTableLocation();
     // first overwrite generate 1 manifest and 1 data file
@@ -698,7 +724,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     assertThat(copiedEntries).contains(deletedDataFilePathInTargetLocation);
   }
 
-  @Test
+  @TestTemplate
   public void testStartSnapshotWithoutValidSnapshot() throws Exception {
     // expire one snapshot
     actions().expireSnapshots(table).expireSnapshotId(table.currentSnapshot().parentId()).execute();
@@ -717,7 +743,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(2, 1, 1, 5, result);
   }
 
-  @Test
+  @TestTemplate
   public void testMoveTheVersionExpireSnapshot() throws Exception {
     // expire one snapshot
     actions().expireSnapshots(table).expireSnapshotId(table.currentSnapshot().parentId()).execute();
@@ -735,7 +761,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(1, 0, 0, 1, result);
   }
 
-  @Test
+  @TestTemplate
   public void testMoveVersionWithInvalidSnapshots() {
     // expire one snapshot
     actions().expireSnapshots(table).expireSnapshotId(table.currentSnapshot().parentId()).execute();
@@ -754,7 +780,7 @@ public class TestRewriteTablePathsAction extends TestBase {
                 + "Please choose an earlier version without invalid snapshots.");
   }
 
-  @Test
+  @TestTemplate
   public void testRollBack() throws Exception {
     long secondSnapshotId = table.currentSnapshot().snapshotId();
 
@@ -781,7 +807,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(6, 3, 3, 15, result);
   }
 
-  @Test
+  @TestTemplate
   public void testWriteAuditPublish() throws Exception {
     // enable WAP
     table.updateProperties().set(TableProperties.WRITE_AUDIT_PUBLISH_ENABLED, "true").commit();
@@ -806,7 +832,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(5, 3, 3, 14, result);
   }
 
-  @Test
+  @TestTemplate
   public void testSchemaChange() throws Exception {
     // change the schema
     table.updateSchema().addColumn("c4", Types.StringType.get()).commit();
@@ -823,7 +849,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(4, 2, 2, 10, result);
   }
 
-  @Test
+  @TestTemplate
   public void testSnapshotIdInheritanceEnabled() throws Exception {
     String sourceTableLocation = newTableLocation();
     Map<String, String> properties = Maps.newHashMap();
@@ -841,7 +867,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(3, 2, 2, 9, result);
   }
 
-  @Test
+  @TestTemplate
   public void testMetadataCompression() throws Exception {
     String sourceTableLocation = newTableLocation();
     Map<String, String> properties = Maps.newHashMap();
@@ -867,7 +893,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(2, 2, 2, 8, result);
   }
 
-  @Test
+  @TestTemplate
   public void testInvalidArgs() {
     RewriteTablePath actions = actions().rewriteTablePath(table);
 
@@ -900,12 +926,12 @@ public class TestRewriteTablePathsAction extends TestBase {
         .hasMessageContaining("End version('null') cannot be empty");
   }
 
-  @Test
+  @TestTemplate
   public void testTableWithManyPartitionStatisticFile() throws IOException {
     String sourceTableLocation = newTableLocation();
-    Map<String, String> properties = Maps.newHashMap();
-    properties.put("format-version", "2");
-    String tableName = "v2tblwithPartStats";
+    Map<String, String> properties =
+        ImmutableMap.of(TableProperties.FORMAT_VERSION, String.valueOf(formatVersion));
+    String tableName = String.format("v%stblwithPartStats", formatVersion);
     Table sourceTable =
         createMetastoreTable(sourceTableLocation, properties, "default", tableName, 0, "c1");
 
@@ -932,12 +958,12 @@ public class TestRewriteTablePathsAction extends TestBase {
         result, "partition-stats", sourceTableLocation, targetTableLocation);
   }
 
-  @Test
+  @TestTemplate
   public void testTableWithManyStatisticFiles() throws IOException {
     String sourceTableLocation = newTableLocation();
-    Map<String, String> properties = Maps.newHashMap();
-    properties.put("format-version", "2");
-    String tableName = "v2tblwithmanystats";
+    Map<String, String> properties =
+        ImmutableMap.of(TableProperties.FORMAT_VERSION, String.valueOf(formatVersion));
+    String tableName = String.format("v%stblwithmanystats", formatVersion);
     Table sourceTable =
         createMetastoreTable(sourceTableLocation, properties, "default", tableName, 0);
 
@@ -961,12 +987,12 @@ public class TestRewriteTablePathsAction extends TestBase {
         iterations * 2 + 1, iterations, iterations, iterations, iterations * 6 + 1, result);
   }
 
-  @Test
+  @TestTemplate
   public void testStatisticsFileSourcePath() throws IOException {
     String sourceTableLocation = newTableLocation();
-    Map<String, String> properties = Maps.newHashMap();
-    properties.put("format-version", "2");
-    String tableName = "v2tblwithstats";
+    Map<String, String> properties =
+        ImmutableMap.of(TableProperties.FORMAT_VERSION, String.valueOf(formatVersion));
+    String tableName = String.format("v%stblwithstats", formatVersion);
     Table sourceTable =
         createMetastoreTable(sourceTableLocation, properties, "default", tableName, 1);
 
@@ -989,13 +1015,17 @@ public class TestRewriteTablePathsAction extends TestBase {
     findAndAssertFileInFileList(result, ".stats", sourceTableLocation, targetTableLocation);
   }
 
-  @Test
+  @TestTemplate
   public void testMetadataCompressionWithMetastoreTable() throws Exception {
     Map<String, String> properties = Maps.newHashMap();
     properties.put(TableProperties.METADATA_COMPRESSION, "gzip");
     Table sourceTable =
         createMetastoreTable(
-            newTableLocation(), properties, "default", "testMetadataCompression", 2);
+            newTableLocation(),
+            properties,
+            "default",
+            String.format("v%sMetadataCompression", formatVersion),
+            2);
 
     TableMetadata currentMetadata = currentMetadata(sourceTable);
 
@@ -1023,10 +1053,11 @@ public class TestRewriteTablePathsAction extends TestBase {
   }
 
   // Metastore table tests
-  @Test
+  @TestTemplate
   public void testMetadataLocationChange() throws Exception {
+    String tableName = String.format("v%stblWithLocation", formatVersion);
     Table sourceTable =
-        createMetastoreTable(newTableLocation(), Maps.newHashMap(), "default", "tbl", 1);
+        createMetastoreTable(newTableLocation(), Maps.newHashMap(), "default", tableName, 1);
     String metadataFilePath = currentMetadata(sourceTable).metadataFileLocation();
 
     String newMetadataDir = "new-metadata-dir";
@@ -1035,7 +1066,7 @@ public class TestRewriteTablePathsAction extends TestBase {
         .set(TableProperties.WRITE_METADATA_LOCATION, newTableLocation() + newMetadataDir)
         .commit();
 
-    spark.sql("insert into hive.default.tbl values (1, 'AAAAAAAAAA', 'AAAA')");
+    sql("insert into hive.default.%s values (1, 'AAAAAAAAAA', 'AAAA')", tableName);
     sourceTable.refresh();
 
     // copy table
@@ -1068,12 +1099,15 @@ public class TestRewriteTablePathsAction extends TestBase {
     checkFileNum(2, 1, 1, 5, result2);
   }
 
-  @Test
+  @TestTemplate
   public void testDeleteFrom() throws Exception {
-    Map<String, String> properties = Maps.newHashMap();
-    properties.put("format-version", "2");
-    properties.put("write.delete.mode", "merge-on-read");
-    String tableName = "v2tbl";
+    Map<String, String> properties =
+        ImmutableMap.of(
+            TableProperties.FORMAT_VERSION,
+            String.valueOf(formatVersion),
+            "write.delete.mode",
+            "merge-on-read");
+    String tableName = String.format("v%stbl", formatVersion);
     Table sourceTable =
         createMetastoreTable(newTableLocation(), properties, "default", tableName, 0);
     // ingest data
@@ -1137,7 +1171,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     assertEquals("Rows must match", originalData, copiedData);
   }
 
-  @Test
+  @TestTemplate
   public void testKryoDeserializeBroadcastValues() {
     sparkContext.getConf().set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
     RewriteTablePathSparkAction action =
@@ -1148,8 +1182,11 @@ public class TestRewriteTablePathsAction extends TestBase {
     assertThat(tableBroadcast.getValue().uuid()).isEqualTo(table.uuid());
   }
 
-  @Test
+  @TestTemplate
   public void testNestedDirectoryStructurePreservation() throws Exception {
+    assumeThat(formatVersion)
+        .as("Can't add multiple DVs for the same data file in v3")
+        .isEqualTo(2);
     String sourceTableLocation = newTableLocation();
     Table sourceTable = createTableWithSnapshots(sourceTableLocation, 1);
 
@@ -1186,12 +1223,18 @@ public class TestRewriteTablePathsAction extends TestBase {
 
     DeleteFile positionDeletes1 =
         FileHelpers.writeDeleteFile(
-                sourceTable, sourceTable.io().newOutputFile(file1.toURI().toString()), deletes1)
+                sourceTable,
+                sourceTable.io().newOutputFile(file1.toURI().toString()),
+                deletes1,
+                formatVersion)
             .first();
 
     DeleteFile positionDeletes2 =
         FileHelpers.writeDeleteFile(
-                sourceTable, sourceTable.io().newOutputFile(file2.toURI().toString()), deletes2)
+                sourceTable,
+                sourceTable.io().newOutputFile(file2.toURI().toString()),
+                deletes2,
+                formatVersion)
             .first();
 
     sourceTable.newRowDelta().addDeletes(positionDeletes1).commit();
@@ -1235,7 +1278,7 @@ public class TestRewriteTablePathsAction extends TestBase {
     assertThat(targetPath2).startsWith(targetTableLocation());
   }
 
-  @Test
+  @TestTemplate
   public void testRewritePathWithoutCreateFileList() throws Exception {
     String targetTableLocation = targetTableLocation();
 

This should make all test run with v2 and v3 and give you only a handful of failing tests that don't work with DVs

@gaborkaszab
Copy link
Collaborator

Thanks for the explanation, @nastra ! Sure, let's add the dimension on the suite level! Another takeaway, that I have to check why these run that slow on my MBP :)

@adawrapub
Copy link
Author

@nastra Can you please review again. Updated based on feedback

@nastra
Copy link
Contributor

nastra commented Oct 28, 2025

@adawrapub I'm a little confused on why you don't have failing tests with v3. Did you apply the diff I mentioned fully?

@adawrapub
Copy link
Author

@adawrapub I'm a little confused on why you don't have failing tests with v3. Did you apply the diff I mentioned fully?

Hello @nastra Thank you for your response and patch. I did apply all your changes which caused some of the test cases to fail. My understanding is Format v3+ uses PUFFIN format for position deletes. The positionDeletesReader method doesn't support PUFFIN format. Hence in test cases like testPositionDeleteWithRow, testPositionDeletes, testDeleteFrom, I added assumeThat(formatVersion) to not run for v3 and v4. Let me know if there is a better way to fix this. I did update FileHelpers.writePosDeleteFile and FileHelpers.writeDeleteFile to take format version.

@nastra
Copy link
Contributor

nastra commented Oct 29, 2025

Hello @nastra Thank you for your response and patch. I did apply all your changes which caused some of the test cases to fail. My understanding is Format v3+ uses PUFFIN format for position deletes. The positionDeletesReader method doesn't support PUFFIN format. Hence in test cases like testPositionDeleteWithRow, testPositionDeletes, testDeleteFrom, I added assumeThat(formatVersion) to not run for v3 and v4. Let me know if there is a better way to fix this. I did update FileHelpers.writePosDeleteFile and FileHelpers.writeDeleteFile to take format version.

Those remaining failing tests is the indicator of the underlying issue that needs to be fixed as part of this PR. We don't want to skip those tests for v3+ but we need to make rewriting table paths work for v3+, which is the scope of #13671 and which I also mentioned in #14226 (comment). You need to apply a similar approach to what has been done in #11657 in order to allow reading v3 deletes and thus make those tests pass

…eparate locations:

Manifest Metadata: DeleteFile.referencedDataFile() field
Puffin Blob Metadata: "referenced-data-file" property inside the blob
The original implementation only updated the manifest metadata. The Puffin blob metadata still contained the old path, causing the DV reader to fail when applying deletes at the new location.

I have implemented a two-pronged update strategy. Please let me know if there is a better way

1. Manifest Metadata Update
Added ContentFileUtil.replaceReferencedDataFile() utility method
Created RewriteTablePathUtil.newPositionDeleteEntry() helper method
Updates the referencedDataFile field in manifest entries
2. Puffin Content Update
Implemented RewriteTablePathUtil.rewriteDVFile() method
Reads Puffin files, updates blob metadata properties, writes new files
Preserves the bitmap data while updating path references
@adawrapub
Copy link
Author

adawrapub commented Nov 4, 2025

Hello @nastra Thank you for your response and patch. I did apply all your changes which caused some of the test cases to fail. My understanding is Format v3+ uses PUFFIN format for position deletes. The positionDeletesReader method doesn't support PUFFIN format. Hence in test cases like testPositionDeleteWithRow, testPositionDeletes, testDeleteFrom, I added assumeThat(formatVersion) to not run for v3 and v4. Let me know if there is a better way to fix this. I did update FileHelpers.writePosDeleteFile and FileHelpers.writeDeleteFile to take format version.

Those remaining failing tests is the indicator of the underlying issue that needs to be fixed as part of this PR. We don't want to skip those tests for v3+ but we need to make rewriting table paths work for v3+, which is the scope of #13671 and which I also mentioned in #14226 (comment). You need to apply a similar approach to what has been done in #11657 in order to allow reading v3 deletes and thus make those tests pass

@nastra As I understood DV files (Puffin format) store the referenced data file path in two separate locations:
Manifest Metadata: DeleteFile.referencedDataFile() field
Puffin Blob Metadata: "referenced-data-file" property inside the blob
The original implementation only updated the manifest metadata. The Puffin blob metadata still contained the old path, causing the DV reader to fail when applying deletes at the new location.

I have implemented a two-pronged update strategy. Please let me know if there is a better way

  1. Manifest Metadata Update
    Added ContentFileUtil.replaceReferencedDataFile() utility method
    Created RewriteTablePathUtil.newPositionDeleteEntry() helper method
    Updates the referencedDataFile field in manifest entries
  2. Puffin Content Update
    Implemented RewriteTablePathUtil.rewriteDVFile() method
    Reads Puffin files, updates blob metadata properties, writes new files
    Preserves the bitmap data while updating path references

* @param targetPrefix target prefix that will replace it
* @return updated referenced data file path, or null if not applicable
*/
public static String replaceReferencedDataFile(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably best to keep this internally in RewriteTablePathUtil

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra Updated. Please let know if it looks good to you

@nastra
Copy link
Contributor

nastra commented Nov 7, 2025

@adawrapub can you resolve conflicts please so that CI runs?

* @param targetPrefix target prefix that will replace it
* @return updated referenced data file path, or null if not applicable
*/
private static String replaceReferencedDataFile(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static String replaceReferencedDataFile(
private static String rewriteReferencedDataFilePathForDV(

throws IOException {
InputFile sourceFile = io.newInputFile(deleteFile.location());

try (org.apache.iceberg.puffin.PuffinReader reader =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try (org.apache.iceberg.puffin.PuffinReader reader =
try (PuffinReader reader =


List<org.apache.iceberg.puffin.BlobMetadata> blobs = reader.fileMetadata().blobs();

try (org.apache.iceberg.puffin.PuffinWriter writer =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try (org.apache.iceberg.puffin.PuffinWriter writer =
try (PuffinWriter writer =

List<org.apache.iceberg.puffin.BlobMetadata> blobs = reader.fileMetadata().blobs();

try (org.apache.iceberg.puffin.PuffinWriter writer =
org.apache.iceberg.puffin.Puffin.write(outputFile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
org.apache.iceberg.puffin.Puffin.write(outputFile)
Puffin.write(outputFile)

throws IOException {
InputFile sourceFile = io.newInputFile(deleteFile.location());

try (org.apache.iceberg.puffin.PuffinReader reader =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to not mix the reader and writer here. First read the blobs, then rewrite them and store them in a list and only then init the writer to write the new blobs out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please apply the below diff?

private static void rewriteDVFile(
      DeleteFile deleteFile,
      OutputFile outputFile,
      FileIO io,
      String sourcePrefix,
      String targetPrefix)
      throws IOException {
    List<Blob> rewrittenBlobs = Lists.newArrayList();
    try (PuffinReader reader = Puffin.read(io.newInputFile(deleteFile.location())).build()) {
      // Read all blobs and rewrite them with updated referenced data file paths
      for (Pair<org.apache.iceberg.puffin.BlobMetadata, ByteBuffer> blobPair :
          reader.readAll(reader.fileMetadata().blobs())) {
        org.apache.iceberg.puffin.BlobMetadata blobMetadata = blobPair.first();
        ByteBuffer blobData = blobPair.second();

        // Get the original properties and update the referenced data file path
        Map<String, String> properties = Maps.newHashMap(blobMetadata.properties());
        String referencedDataFile = properties.get("referenced-data-file");
        if (referencedDataFile != null && referencedDataFile.startsWith(sourcePrefix)) {
          String newReferencedDataFile = newPath(referencedDataFile, sourcePrefix, targetPrefix);
          properties.put("referenced-data-file", newReferencedDataFile);
        }

        // Create a new blob with updated properties
        rewrittenBlobs.add(
            new Blob(
                blobMetadata.type(),
                blobMetadata.inputFields(),
                blobMetadata.snapshotId(),
                blobMetadata.sequenceNumber(),
                blobData,
                PuffinCompressionCodec.forName(blobMetadata.compressionCodec()),
                properties));
      }
    }

    try (PuffinWriter writer =
        Puffin.write(outputFile).createdBy(IcebergBuild.fullVersion()).build()) {
      rewrittenBlobs.forEach(writer::write);
    }
  }

blobMetadata.snapshotId(),
blobMetadata.sequenceNumber(),
blobData,
null, // compression codec (keep uncompressed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
null, // compression codec (keep uncompressed)
PuffinCompressionCodec.forName(blobMetadata.compressionCodec()),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants