Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fsspec as alternate file protocol #946

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chrisroat
Copy link
Contributor

@chrisroat chrisroat commented Aug 22, 2021

In principle, fsspec can completely subsume the current protocols (file and s3) and can provide many more. However, since it is a new integration and getting credentialling correct can be difficult, I implemented it as an alternate protocol which people can migrate to.

Some s3fs and gcsfs dependencies (installed via fsspec[s3,gcs]) do not have wheels for alpine, so they need to be built during startup. To correctly run the tests, the following extra install of dev packages is needed:

apk add musl-dev

Addresses #911

In principle, fsspec can completely subsume the current protocols (file and s3) and can provide many more.  However, since it is a new integration and getting credentialling correct can be difficult, I implemented it as an alternate protocol which people can migrate to.

Neither s3fs and gcsfs (installed via fsspec[s3,gcs]) have wheels for alpine, so they need to be built during startup.   To correctly run the tests, the following extra install of dev packages is needed:

apk add musl-dev
@chrisroat
Copy link
Contributor Author

chrisroat commented Aug 22, 2021

The CI failures are due to problems building wheels for dependencies of s3fs and gcsfs. I believe this will require updates to the datajoint/pydev image to include additional dev packages.

@chrisroat
Copy link
Contributor Author

As per my comments above, I believe these failures need apk add musl-dev in datajoint/pydev in order to succeed. If you'd like to move forward with this, we can work on making that update and resolving conflicts.

@guzman-raphael
Copy link
Collaborator

@chrisroat Thanks for the PR and good stuff here man! 🚀

We are currently playing a bit of some catch-up on other regression issues and new team members (they've recently patched 2 issues you opened!) but we haven't forgotten about this PR. We may be a bit slow in our response/review while adjusting to this new cycle so we appreciate your patience with this.

Regarding the issues of the docker images here, we are also currently in the process of upgrading it to a new image stack so we will likely not continue using datajoint/pydev in favor of datajoint/djtest. A PR for that update will soon be coming addressing that along with fixing the broken datajoint/datajoint image. Regarding the external storage support for fsspec, is this purely a Python dependency or are other system dependencies involved?

@chrisroat
Copy link
Contributor Author

Yes, fsspec is python only (AFAIK). It actually doesn't seem to have any dependencies, as it is really just a specification.

There can be extra packages, depending on which underlying filesystem is being used (i.e. gcsfs for GCS, s3fs for S3) -- and those could have additional dependencies. Datajoint could include some of those packages/dependencies, or leave it up to the user to install what they need.

I'd personally lean toward having the user install what they need, but the fact that datajoint currently brings in S3 dependencies may have already set a different precedent.

@chrisroat
Copy link
Contributor Author

Is there still interest in this?

@horsto
Copy link
Contributor

horsto commented Dec 27, 2022

I want to use my dropbox business account as external drive location to my datajoint schema. Since dropbox is not S3, this option is currently not available - but, as I understand, this PR + some additions (https://github.com/fsspec/dropboxdrivefs), would enable it? I'd be grateful for any tips.

@dimitri-yatsenko
Copy link
Member

Ok, let's test and merge this and test with Dropbox. Thank you for finding this @horsto and thank you for your patience @chrisroat.

@guzman-raphael
Copy link
Collaborator

@chrisroat DataJoint core team has opened some bandwidth and we are finally coming back to review the outstanding PRs that are in the queue. Thinking back on your contribution, I really like the idea of conforming to a single interface to simplify how we add support for new external storage options.

On that note, why did you choose fsspec as opposed to FUSE-based approaches? Is there an additional benefit to using the specification as you proposed?

Some additional context, we've now had much more experience using s3fs-fuse and it is demonstrating amazing performance! We've been mounting S3 BLOBs (>150GB) and accessing the files as if they were local from Python with read/write speeds consistent with online speedtests (900 Mbps in our HQ).

FUSE projects have been around for quite some time and the above in particular has: caching features, strong community acceptance, compatibility for STS, and more. Perhaps the common interface is simply treating objects like a normal file and letting FUSE do the rest? 🤷

Curious on your thoughts.

@guzman-raphael
Copy link
Collaborator

rclone seems like a very promising one as well.

@chrisroat
Copy link
Contributor Author

You can use fsspec with fuse-mounted filesystems, so I don't think they are competitors in that way. See the docs:
https://filesystem-spec.readthedocs.io/en/latest/features.html#mount-anything-with-fuse

I think fsspec is heavily used in the pydata ecosystem (i.e. pandas, dask), so it is a natural to support it. You'll benefit from improvements and interoperability.

Regarding rclone, I note that someone even tried to make an fsspec implementation for it (it doesn't look supported, though):
https://pypi.org/project/fsspec-rclone/

@dimitri-yatsenko
Copy link
Member

ok, let's integrate it before we add new file management

@dimitri-yatsenko
Copy link
Member

Hi @chrisroat, Thank you for your patience. We are working on merging this. We just need to redo the tests in pytest. If you are unavailable for this, we will re-open a new PR where we will fix this.

@chrisroat
Copy link
Contributor Author

chrisroat commented Sep 13, 2024 via email

@ethho
Copy link
Contributor

ethho commented Sep 13, 2024

@dimitri-yatsenko Okay, I'll take care of migration to pytest.

@dimitri-yatsenko
Copy link
Member

@chrisroat Thank you for your contribution. We will close your PR and @ethho will reopen a new one with your changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants