-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Preserve DV-specific fields for deletion vectors #14351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
...k/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
Show resolved
Hide resolved
...k/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
Outdated
Show resolved
Hide resolved
...k/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
Outdated
Show resolved
Hide resolved
...k/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
Outdated
Show resolved
Hide resolved
...k/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
Outdated
Show resolved
Hide resolved
...k/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
Outdated
Show resolved
Hide resolved
|
Thanks for pointing me to this PR, @nastra! As I see there is going to be a Anyway, @adawrapub I think something went sideways with the formatting for the older Spark tests, as there are a lot of lines changing the indentation, and also the build fails because of that. |
|
I'd like to echo an observation I write to the other PR: the runtime of this test suite seems long, introducing test dimensions could make it worse. I'd be careful here and only cover different format version where it matters and in fact adds to the test coverage. Another thing: @adawrapub I saw that some of the tests aren't run with the new params (like testTableWithManyPartitionStatisticFile). If I'm not mistaken these are the ones that create the test table not by the @beforeeach functionality but by calling
|
|
Thank you for the feedback @nastra and @gaborkaszab. I have updated the PR based on your comments. |
| return TestHelpers.V2_AND_ABOVE; | ||
| } | ||
|
|
||
| @Parameter private int formatVersion = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| @Parameter private int formatVersion = 2; | |
| @Parameter private int formatVersion; |
|
@adawrapub there are still a bunch of tests that have |
|
I took another look and I'm still hesitant about the testing part:
My proposal is to identify the subset of tests where the format version is relevant now, practically the ones having delete files, and add test param for them, but not for the entire suite. Unless I miss something :) |
|
@gaborkaszab It takes me 1min 17 secs to run
This is because calls to @adawrapub also just preserving the DVs won't make the test suite pass automatically, because there's another issue that needs to be fixed so that rewriting table paths fully works with v3, which I mentioned in #14226 (comment)
The scope of #13671 is to fix DVs with rewriting table paths and it might seem ok to only parameterize the test methods that write DVs but it's possible that we're missing other subtle bugs around rewriting (since we're always doing this with v2 only), hence I'm advocating to parameterize the format version at the class level to make sure we're not missing anything. However, I understand the concern around long running test suites with a large test matrix, so we still need to keep this in mind and maybe selectively skip tests for certain format versions. @adawrapub can you please apply the following diff? This should make all test run with v2 and v3 and give you only a handful of failing tests that don't work with DVs |
|
Thanks for the explanation, @nastra ! Sure, let's add the dimension on the suite level! Another takeaway, that I have to check why these run that slow on my MBP :) |
|
@nastra Can you please review again. Updated based on feedback |
|
@adawrapub I'm a little confused on why you don't have failing tests with v3. Did you apply the diff I mentioned fully? |
Hello @nastra Thank you for your response and patch. I did apply all your changes which caused some of the test cases to fail. My understanding is Format v3+ uses PUFFIN format for position deletes. The positionDeletesReader method doesn't support PUFFIN format. Hence in test cases like testPositionDeleteWithRow, testPositionDeletes, testDeleteFrom, I added assumeThat(formatVersion) to not run for v3 and v4. Let me know if there is a better way to fix this. I did update FileHelpers.writePosDeleteFile and FileHelpers.writeDeleteFile to take format version. |
Those remaining failing tests is the indicator of the underlying issue that needs to be fixed as part of this PR. We don't want to skip those tests for v3+ but we need to make rewriting table paths work for v3+, which is the scope of #13671 and which I also mentioned in #14226 (comment). You need to apply a similar approach to what has been done in #11657 in order to allow reading v3 deletes and thus make those tests pass |
…eparate locations: Manifest Metadata: DeleteFile.referencedDataFile() field Puffin Blob Metadata: "referenced-data-file" property inside the blob The original implementation only updated the manifest metadata. The Puffin blob metadata still contained the old path, causing the DV reader to fail when applying deletes at the new location. I have implemented a two-pronged update strategy. Please let me know if there is a better way 1. Manifest Metadata Update Added ContentFileUtil.replaceReferencedDataFile() utility method Created RewriteTablePathUtil.newPositionDeleteEntry() helper method Updates the referencedDataFile field in manifest entries 2. Puffin Content Update Implemented RewriteTablePathUtil.rewriteDVFile() method Reads Puffin files, updates blob metadata properties, writes new files Preserves the bitmap data while updating path references
@nastra As I understood DV files (Puffin format) store the referenced data file path in two separate locations: I have implemented a two-pronged update strategy. Please let me know if there is a better way
|
| * @param targetPrefix target prefix that will replace it | ||
| * @return updated referenced data file path, or null if not applicable | ||
| */ | ||
| public static String replaceReferencedDataFile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably best to keep this internally in RewriteTablePathUtil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nastra Updated. Please let know if it looks good to you
|
@adawrapub can you resolve conflicts please so that CI runs? |
| * @param targetPrefix target prefix that will replace it | ||
| * @return updated referenced data file path, or null if not applicable | ||
| */ | ||
| private static String replaceReferencedDataFile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| private static String replaceReferencedDataFile( | |
| private static String rewriteReferencedDataFilePathForDV( |
| throws IOException { | ||
| InputFile sourceFile = io.newInputFile(deleteFile.location()); | ||
|
|
||
| try (org.apache.iceberg.puffin.PuffinReader reader = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| try (org.apache.iceberg.puffin.PuffinReader reader = | |
| try (PuffinReader reader = |
|
|
||
| List<org.apache.iceberg.puffin.BlobMetadata> blobs = reader.fileMetadata().blobs(); | ||
|
|
||
| try (org.apache.iceberg.puffin.PuffinWriter writer = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| try (org.apache.iceberg.puffin.PuffinWriter writer = | |
| try (PuffinWriter writer = |
| List<org.apache.iceberg.puffin.BlobMetadata> blobs = reader.fileMetadata().blobs(); | ||
|
|
||
| try (org.apache.iceberg.puffin.PuffinWriter writer = | ||
| org.apache.iceberg.puffin.Puffin.write(outputFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| org.apache.iceberg.puffin.Puffin.write(outputFile) | |
| Puffin.write(outputFile) |
| throws IOException { | ||
| InputFile sourceFile = io.newInputFile(deleteFile.location()); | ||
|
|
||
| try (org.apache.iceberg.puffin.PuffinReader reader = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to not mix the reader and writer here. First read the blobs, then rewrite them and store them in a list and only then init the writer to write the new blobs out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please apply the below diff?
private static void rewriteDVFile(
DeleteFile deleteFile,
OutputFile outputFile,
FileIO io,
String sourcePrefix,
String targetPrefix)
throws IOException {
List<Blob> rewrittenBlobs = Lists.newArrayList();
try (PuffinReader reader = Puffin.read(io.newInputFile(deleteFile.location())).build()) {
// Read all blobs and rewrite them with updated referenced data file paths
for (Pair<org.apache.iceberg.puffin.BlobMetadata, ByteBuffer> blobPair :
reader.readAll(reader.fileMetadata().blobs())) {
org.apache.iceberg.puffin.BlobMetadata blobMetadata = blobPair.first();
ByteBuffer blobData = blobPair.second();
// Get the original properties and update the referenced data file path
Map<String, String> properties = Maps.newHashMap(blobMetadata.properties());
String referencedDataFile = properties.get("referenced-data-file");
if (referencedDataFile != null && referencedDataFile.startsWith(sourcePrefix)) {
String newReferencedDataFile = newPath(referencedDataFile, sourcePrefix, targetPrefix);
properties.put("referenced-data-file", newReferencedDataFile);
}
// Create a new blob with updated properties
rewrittenBlobs.add(
new Blob(
blobMetadata.type(),
blobMetadata.inputFields(),
blobMetadata.snapshotId(),
blobMetadata.sequenceNumber(),
blobData,
PuffinCompressionCodec.forName(blobMetadata.compressionCodec()),
properties));
}
}
try (PuffinWriter writer =
Puffin.write(outputFile).createdBy(IcebergBuild.fullVersion()).build()) {
rewrittenBlobs.forEach(writer::write);
}
}
| blobMetadata.snapshotId(), | ||
| blobMetadata.sequenceNumber(), | ||
| blobData, | ||
| null, // compression codec (keep uncompressed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| null, // compression codec (keep uncompressed) | |
| PuffinCompressionCodec.forName(blobMetadata.compressionCodec()), |
This is to fix problem reported in #13671 by preserving DV specific fields. Addressed all PR comments
Original PR got closed because of issues with branch. Please let me know if there is a better way to handle this
cc @anuragmantri @nastra and @amogh-jahagirdar