Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SupportsPrefixOperations for Remove OrphanFile Procedure #11906

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ismailsimsek
Copy link
Contributor

@ismailsimsek ismailsimsek commented Jan 4, 2025

Continuing #7914

  • added pathFilter to new method (listWithPrefix), PartitionAwareHiddenPathFilter
  • Added tests that new method and previous one returns same values
  • Fix failing test testHiddenPathsStartingWithPartitionNamesAreIgnored

With this change, current executions now use HadoopFileIO, which implements DelegateFileIO and SupportPrefixOperations. This results in calls to the new listWithPrefix method.

@github-actions github-actions bot added the spark label Jan 4, 2025
@ismailsimsek ismailsimsek force-pushed the fix-remove-orphan-file-action branch from 0846141 to 6267e48 Compare January 4, 2025 19:16
@ismailsimsek ismailsimsek force-pushed the fix-remove-orphan-file-action branch from 6267e48 to a191684 Compare January 5, 2025 11:59
@ismailsimsek
Copy link
Contributor Author

cc @flyrain @RussellSpitzer @rahil-c its ready for review and test added. also will appreciate any suggestion on the failing test.

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Jan 7, 2025

The test here says it's failling because you are deleting

    but the following elements were unexpected:
      ["file:/tmp/junit-14563533605645158466/data/_c2_trunc/subfolder/file.txt",
        "file:/tmp/junit-14563533605645158466/data/_c2_trunc/file.txt"]
        ```
        
When those are not files that are traditionally removed by remove orphan files (which only removes files with certain prefixes.) This is a behavior change and I don't think it's actually beneficial so is there any way to fix this?

```TestRemoveOrphanFilesAction3 > testHiddenPathsStartingWithPartitionNamesAreIgnored() > formatVersion = 3 FAILED
    java.lang.AssertionError: [same as] 
    Expecting actual:
      ["file:/tmp/junit-14563533605645158466/metadata/v1.metadata.json",
        "file:/tmp/junit-14563533605645158466/metadata/v2.metadata.json",
        "file:/tmp/junit-14563533605645158466/metadata/2607cb5e-4b09-49d3-8887-52a5341aaaa9-m0.avro",
        "file:/tmp/junit-14563533605645158466/metadata/version-hint.text",
        "file:/tmp/junit-14563533605645158466/metadata/snap-1334501503484114515-1-2607cb5e-4b09-49d3-8887-52a5341aaaa9.avro",
        "file:/tmp/junit-14563533605645158466/data/_c2_trunc/subfolder/file.txt",
        "file:/tmp/junit-14563533605645158466/data/_c2_trunc/file.txt",
        "file:/tmp/junit-14563533605645158466/data/_c2_trunc=AA/c3=AAAA/00000-864-80ba5c39-c3da-4be3-9783-5d8666c89ccc-0-00001.parquet"]
    to contain exactly in any order:
      ["file:/tmp/junit-14563533605645158466/metadata/v1.metadata.json",
        "file:/tmp/junit-14563533605645158466/metadata/v2.metadata.json",
        "file:/tmp/junit-14563533605645158466/metadata/2607cb5e-4b09-49d3-8887-52a5341aaaa9-m0.avro",
        "file:/tmp/junit-14563533605645158466/metadata/version-hint.text",
        "file:/tmp/junit-14563533605645158466/metadata/snap-1334501503484114515-1-2607cb5e-4b09-49d3-8887-52a5341aaaa9.avro",
        "file:/tmp/junit-14563533605645158466/data/_c2_trunc=AA/c3=AAAA/00000-864-80ba5c39-c3da-4be3-9783-5d8666c89ccc-0-00001.parquet"]
    but the following elements were unexpected:
      ["file:/tmp/junit-14563533605645158466/data/_c2_trunc/subfolder/file.txt",
        "file:/tmp/junit-14563533605645158466/data/_c2_trunc/file.txt"]

@@ -854,12 +867,14 @@ public void testCompareToFileList() throws IOException {
.as("Invalid file should be present")
.isTrue();

DeleteOrphanFiles.Result result3 =
DeleteOrphanFilesSparkAction action3 =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are modifying things here, please rename all name# variables to something relevant to the test. The names should be relevant to what we are checking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to more descriptive names

… 3.5, improve naming

Co-authored-by: Rahil Chertara <[email protected]>
@ismailsimsek ismailsimsek force-pushed the fix-remove-orphan-file-action branch from bdd982c to 2212f2a Compare January 8, 2025 12:34
Comment on lines +648 to +652
public boolean hasHiddenPttParentFolder(Path path) {
return Stream.iterate(path, Path::getParent)
.takeWhile(Objects::nonNull)
.anyMatch(parentPath -> !doAccept(parentPath));
}
Copy link
Contributor Author

@ismailsimsek ismailsimsek Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it will check parent folders per file, to ensure none of the parent folder is hiddenpartition folder. this might be less performant for large list, if performance is a concern.

Comment on lines +321 to +323
// NOTE: check the path relative to table location. To avoid checking un necessary root
// folders
Path relativeFilePath = new Path(fileInfo.location().replace(location, ""));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating relative path to avoid checking parent folders of the table. however this replace(location, "")); might not be the best solution. open to any ideas

@danielcweeks
Copy link
Contributor

@ismailsimsek my issue with this PR is the same as the previous pr. This isn't a scaleable solution. The file system approach was able to parallelize the work through directory traversal, but this does not.

I think we need a way to break up the prefixes appropriately so that we can distribute the listing.

@danielcweeks danielcweeks self-requested a review January 15, 2025 19:13
}

@VisibleForTesting
Dataset<String> listWithPrefix() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a way to break up the key space, possibly by taking hints from what LocationProvider is configured for the table. A single listing is not scalable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants