Redundant scandir calls when listing source plugin #18369

bwalkowi · 2024-06-11T09:06:07Z

Describe the bug
While testing the Onedata file source plugin, I encountered the following issue: directory listing triggers scandir SIX TIMES (sic!) and results in ambiguous errors (e.g., inability to list a directory due to lack of permissions in another directory). The likely culprit is the recently added _get_total_matches_count method in the PyFilesystem2FilesSource class (PR: #18059):

def _get_total_matches_count(self, fs: FS, path: str, filter: Optional[List[str]] = None) -> int:
    # For some reason, using "*" as glob does not return all files and directories, only files.
    # So we need to count files and directories "*/" separately.
    # Also, some filesystems do not properly support directories count (like Google Cloud Storage),
    # so we need to catch TypeError exceptions and fallback to 0.
    files_glob_pattern = f"{path}/{filter[0] if filter else '*'}"
    try:
        files_count = fs.glob(files_glob_pattern).count().files
    except TypeError:
        files_count = 0

    directory_glob_pattern = f"{files_glob_pattern}/"
    try:
        directories_count = fs.glob(directory_glob_pattern).count().directories
    except TypeError:
        directories_count = 0
    return files_count + directories_count

This method is called during each directory listing:

def _list(...) -> Tuple[List[AnyRemoteEntry], int]:
    ...
    page = self._to_page(limit, offset)
    filter = self._query_to_filter(query)
    count = self._get_total_matches_count(h, path, filter)
    result = h.filterdir(path, namespaces=["details"], page=page, files=filter, dirs=filter)
    to_dict = functools.partial(self._resource_info_to_dict, path)
    return list(map(to_dict, result)), count

Besides listing a specific range of files, two additional glob calls are made to count the total number of files in the directory. These glob operations trigger scandir not only on the listed directory but traverse the file tree from the root down to one level deeper than the listed directory.

Given the following directory structure:

root
    dir1
    dir2
        dir21
        file

Attempting to list dir2 without having permissions for dir1 or dir21 causes glob to return an error, failing the entire listing process. This can be partially mitigated by providing the path parameter when calling glob (the default is , which causes scanning from the root):

files_count = fs.glob(files_glob_pattern, path=path).count().files

However, this does not solve the second problem.

Listing plus glob results in three scandir calls (since glob starts from the root, the number of calls can vary based on the nesting level). Multiply this by two for the first directory listing when the /api/remote_files endpoint is called twice:

First call, probably to get the total file count in the directory (returns total_matches header) - most parameters are undefined (limit == None, offset == None).
Second call to fetch entries for the first window (paging) - offset == 0 and limit == 25.

Switching to the next window lists only the appropriate chunk (e.g., offset == 50 and limit == 25). Nevertheless, glob is still executed.

Galaxy Version and/or server at which you observed the bug
Galaxy Version: 24.2.dev0
Commit: ed117ef

To Reproduce
Steps to reproduce the behavior:

List any source plugin in Web UI and log PyFilesystem2FilesSource._list calls

Expected behavior

Reconsider the idea of counting files. For vast directories (which happen in scientific datasets) this operation may be very costly - but is performed only to know the count of elements. Of course, the pagination would have to be reworked, as it is not possible to know the number of pages beforehand. Some ideas:
- Count only up to 1000 elements. For smaller (<=1000) directories, it can work as currently. Treat bigger (>1000) directories as possibly "infinite". Most directories are less than a 1000 elements, and for bigger ones, navigation using pages may be not viable anyway.
- For "infinite" directories, allow jumping to an arbitrary page - if it exceeds the actual range, display the last page.
- Implement alternative navigation such as "jump to" where the user specifies a lexicographic prefix.
Separate Endpoint for glob: If glob is needed only for the initial listing to determine the total file count in the directory and is not used afterward, consider implementing a separate endpoint called once before the actual listing.
Use scandir or filterdir Directly: Instead of glob, which iterates scandir from the root, consider using scandir or filterdir directly:

sum(1 for _ in h.filterdir(path, namespaces=["details"], files=filter, dirs=filter))

Unfortunately, this also lists all files in the directory. It is unclear whether counting files without listing them is feasible, especially in the generic PyFilesystem2FilesSource.

The text was updated successfully, but these errors were encountered:

davelopez · 2024-06-13T09:17:18Z

The additional unpaginated request was not to retrieve the counting, it was a bug in the UI when setting up the items provider and should be fixed by #18387. This was, probably, the most impacting issue.

Thanks again for the very detailed report, it helped a lot.

davelopez · 2024-06-18T14:39:13Z

Hopefully, with #18372 and #18387 this is in better shape now.
We could think of some other ways to improve performance in the dev branch but for a fix to 24.1 this seems reasonable so I will close it. Please feel free to reopen if it is still a concern.

mvdbeek added this to the 24.1 milestone Jun 11, 2024

mvdbeek added kind/bug area/backend labels Jun 11, 2024

This was referenced Jun 11, 2024

Test onedata objectstore with new caching bwalkowi/galaxy#4

Closed

[24.1] Onedada object store and files source stability fixes #18372

Merged

mvdbeek assigned davelopez Jun 11, 2024

davelopez mentioned this issue Jun 13, 2024

[24.1] Fix duplicated requests when browsing files sources with pagination #18387

Merged

4 tasks

davelopez added the area/UI-UX label Jun 13, 2024

davelopez closed this as completed Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redundant scandir calls when listing source plugin #18369

Redundant scandir calls when listing source plugin #18369

bwalkowi commented Jun 11, 2024 •

edited

Loading

davelopez commented Jun 13, 2024 •

edited

Loading

davelopez commented Jun 18, 2024

Redundant scandir calls when listing source plugin #18369

Redundant scandir calls when listing source plugin #18369

Comments

bwalkowi commented Jun 11, 2024 • edited Loading

davelopez commented Jun 13, 2024 • edited Loading

davelopez commented Jun 18, 2024

bwalkowi commented Jun 11, 2024 •

edited

Loading

davelopez commented Jun 13, 2024 •

edited

Loading