Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Walker for key-value-store filesystems #45

Open
birnbaum opened this issue Sep 26, 2018 · 3 comments
Open

Custom Walker for key-value-store filesystems #45

birnbaum opened this issue Sep 26, 2018 · 3 comments

Comments

@birnbaum
Copy link

Has there any work been done towards a custom Walker for key-value-store filesystems? Walking with the standard Walker is extremely slow on my gcsfs implementation because of all the requests and I can imagine it's the same on s3fs.

Walking on buckets could be implemented fairly efficient because it comes down to something like bucket.list() and one would just need to format the walk output correctly. This way we would need way less S3/GCS calls. Am I missing something here or is this correct?

Are there currently any custom Walker implementations? And where would such a custom Walker live? In the main pyfilesystem2 repo?

Thanks! :)

@willmcgugan
Copy link
Member

Not that I know of, but it would be a good idea.

There is a walker_class class attribute on the base FS. In theory, if you supply a custom walker class with the same interface, everything should just work.

Let me know if you need any help with that. I would almost certainly want to borrow the implementation for S3FS.

BTW If you are copying files, the slow walking is somewhat ameliorated by the multi-threaded copying. Since the walking can be done in the background..

@birnbaum
Copy link
Author

Cool, thanks for the tips, I'll give it a try!

The walking is slow in my use case because I am walking over deeply "nested" keys. For every level, a separate request is sent to GCS which is, of course, a lot slower than retrieving the keys in large batches and "faking" the (path, dirs, files) tuple under the hood.

@birnbaum
Copy link
Author

Unfortunately, it's a little more complicated than I thought. For example:

The first element returned by walk is supposed to contain all dirs and files on the root level. Now you have two options:

  1. Either you follow the standard walk algorithm and only list until the first delimiter. This means you will get the first result very quickly but the walk will get extremely slow if you have a lot of folders down the way.
  2. You do it the way I proposed and send as few requests as possible by retrieving your keys in large batches. The problem here: You have to list the entire bucket first before you can return the first element as you cannot compute the tuple fields dirs and files before.

Unfortunately there is no real way to be smart here, one can not anticipate how many files or folders are in a bucket and which algorithm will be faster/make more sense. In general, if you know that you will need to walk the entire fs anyway, option 2 will be a lot faster (which is my use case). I don't think it should be the default walker though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants