Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redundant scandir calls when listing source plugin #18369

Closed
bwalkowi opened this issue Jun 11, 2024 · 2 comments
Closed

Redundant scandir calls when listing source plugin #18369

bwalkowi opened this issue Jun 11, 2024 · 2 comments

Comments

@bwalkowi
Copy link
Contributor

bwalkowi commented Jun 11, 2024

Describe the bug
While testing the Onedata file source plugin, I encountered the following issue: directory listing triggers scandir SIX TIMES (sic!) and results in ambiguous errors (e.g., inability to list a directory due to lack of permissions in another directory). The likely culprit is the recently added _get_total_matches_count method in the PyFilesystem2FilesSource class (PR: #18059):

def _get_total_matches_count(self, fs: FS, path: str, filter: Optional[List[str]] = None) -> int:
    # For some reason, using "*" as glob does not return all files and directories, only files.
    # So we need to count files and directories "*/" separately.
    # Also, some filesystems do not properly support directories count (like Google Cloud Storage),
    # so we need to catch TypeError exceptions and fallback to 0.
    files_glob_pattern = f"{path}/{filter[0] if filter else '*'}"
    try:
        files_count = fs.glob(files_glob_pattern).count().files
    except TypeError:
        files_count = 0

    directory_glob_pattern = f"{files_glob_pattern}/"
    try:
        directories_count = fs.glob(directory_glob_pattern).count().directories
    except TypeError:
        directories_count = 0
    return files_count + directories_count

This method is called during each directory listing:

def _list(...) -> Tuple[List[AnyRemoteEntry], int]:
    ...
    page = self._to_page(limit, offset)
    filter = self._query_to_filter(query)
    count = self._get_total_matches_count(h, path, filter)
    result = h.filterdir(path, namespaces=["details"], page=page, files=filter, dirs=filter)
    to_dict = functools.partial(self._resource_info_to_dict, path)
    return list(map(to_dict, result)), count

Besides listing a specific range of files, two additional glob calls are made to count the total number of files in the directory. These glob operations trigger scandir not only on the listed directory but traverse the file tree from the root down to one level deeper than the listed directory.

Given the following directory structure:

root
    dir1
    dir2
        dir21
        file

Attempting to list dir2 without having permissions for dir1 or dir21 causes glob to return an error, failing the entire listing process. This can be partially mitigated by providing the path parameter when calling glob (the default is , which causes scanning from the root):

files_count = fs.glob(files_glob_pattern, path=path).count().files

However, this does not solve the second problem.

Listing plus glob results in three scandir calls (since glob starts from the root, the number of calls can vary based on the nesting level). Multiply this by two for the first directory listing when the /api/remote_files endpoint is called twice:

  • First call, probably to get the total file count in the directory (returns total_matches header) - most parameters are undefined (limit == None, offset == None).
  • Second call to fetch entries for the first window (paging) - offset == 0 and limit == 25.

Switching to the next window lists only the appropriate chunk (e.g., offset == 50 and limit == 25). Nevertheless, glob is still executed.

Galaxy Version and/or server at which you observed the bug
Galaxy Version: 24.2.dev0
Commit: ed117ef

To Reproduce
Steps to reproduce the behavior:

  1. List any source plugin in Web UI and log PyFilesystem2FilesSource._list calls

Expected behavior

  1. Reconsider the idea of counting files. For vast directories (which happen in scientific datasets) this operation may be very costly - but is performed only to know the count of elements. Of course, the pagination would have to be reworked, as it is not possible to know the number of pages beforehand. Some ideas:
    • Count only up to 1000 elements. For smaller (<=1000) directories, it can work as currently. Treat bigger (>1000) directories as possibly "infinite". Most directories are less than a 1000 elements, and for bigger ones, navigation using pages may be not viable anyway.
    • For "infinite" directories, allow jumping to an arbitrary page - if it exceeds the actual range, display the last page.
    • Implement alternative navigation such as "jump to" where the user specifies a lexicographic prefix.
  2. Separate Endpoint for glob: If glob is needed only for the initial listing to determine the total file count in the directory and is not used afterward, consider implementing a separate endpoint called once before the actual listing.
  3. Use scandir or filterdir Directly: Instead of glob, which iterates scandir from the root, consider using scandir or filterdir directly:

sum(1 for _ in h.filterdir(path, namespaces=["details"], files=filter, dirs=filter))

Unfortunately, this also lists all files in the directory. It is unclear whether counting files without listing them is feasible, especially in the generic PyFilesystem2FilesSource.

@davelopez
Copy link
Contributor

davelopez commented Jun 13, 2024

The additional unpaginated request was not to retrieve the counting, it was a bug in the UI when setting up the items provider and should be fixed by #18387. This was, probably, the most impacting issue.

Thanks again for the very detailed report, it helped a lot.

@davelopez
Copy link
Contributor

Hopefully, with #18372 and #18387 this is in better shape now.
We could think of some other ways to improve performance in the dev branch but for a fix to 24.1 this seems reasonable so I will close it. Please feel free to reopen if it is still a concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants