-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use SupportsPrefixOperations for Remove OrphanFile Procedure #11906
base: main
Are you sure you want to change the base?
Use SupportsPrefixOperations for Remove OrphanFile Procedure #11906
Conversation
0846141
to
6267e48
Compare
...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java
Show resolved
Hide resolved
.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java
Show resolved
Hide resolved
6267e48
to
a191684
Compare
cc @flyrain @RussellSpitzer @rahil-c its ready for review and test added. also will appreciate any suggestion on the failing test. |
The test here says it's failling because you are deleting
|
@@ -854,12 +867,14 @@ public void testCompareToFileList() throws IOException { | |||
.as("Invalid file should be present") | |||
.isTrue(); | |||
|
|||
DeleteOrphanFiles.Result result3 = | |||
DeleteOrphanFilesSparkAction action3 = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we are modifying things here, please rename all name# variables to something relevant to the test. The names should be relevant to what we are checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to more descriptive names
… 3.5, improve naming Co-authored-by: Rahil Chertara <[email protected]>
bdd982c
to
2212f2a
Compare
public boolean hasHiddenPttParentFolder(Path path) { | ||
return Stream.iterate(path, Path::getParent) | ||
.takeWhile(Objects::nonNull) | ||
.anyMatch(parentPath -> !doAccept(parentPath)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it will check parent folders per file, to ensure none of the parent folder is hiddenpartition folder. this might be less performant for large list, if performance is a concern.
// NOTE: check the path relative to table location. To avoid checking un necessary root | ||
// folders | ||
Path relativeFilePath = new Path(fileInfo.location().replace(location, "")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
creating relative path to avoid checking parent folders of the table. however this replace(location, ""));
might not be the best solution. open to any ideas
@ismailsimsek my issue with this PR is the same as the previous pr. This isn't a scaleable solution. The file system approach was able to parallelize the work through directory traversal, but this does not. I think we need a way to break up the prefixes appropriately so that we can distribute the listing. |
} | ||
|
||
@VisibleForTesting | ||
Dataset<String> listWithPrefix() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a way to break up the key space, possibly by taking hints from what LocationProvider
is configured for the table. A single listing is not scalable.
Continuing #7914
listWithPrefix
),PartitionAwareHiddenPathFilter
testHiddenPathsStartingWithPartitionNamesAreIgnored
With this change, current executions now use
HadoopFileIO
, which implementsDelegateFileIO
andSupportPrefixOperations
. This results in calls to the newlistWithPrefix
method.