-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fsspec as alternate file protocol #946
base: master
Are you sure you want to change the base?
Conversation
In principle, fsspec can completely subsume the current protocols (file and s3) and can provide many more. However, since it is a new integration and getting credentialling correct can be difficult, I implemented it as an alternate protocol which people can migrate to. Neither s3fs and gcsfs (installed via fsspec[s3,gcs]) have wheels for alpine, so they need to be built during startup. To correctly run the tests, the following extra install of dev packages is needed: apk add musl-dev
The CI failures are due to problems building wheels for dependencies of s3fs and gcsfs. I believe this will require updates to the datajoint/pydev image to include additional dev packages. |
As per my comments above, I believe these failures need |
@chrisroat Thanks for the PR and good stuff here man! 🚀 We are currently playing a bit of some catch-up on other regression issues and new team members (they've recently patched 2 issues you opened!) but we haven't forgotten about this PR. We may be a bit slow in our response/review while adjusting to this new cycle so we appreciate your patience with this. Regarding the issues of the docker images here, we are also currently in the process of upgrading it to a new image stack so we will likely not continue using |
Yes, fsspec is python only (AFAIK). It actually doesn't seem to have any dependencies, as it is really just a specification. There can be extra packages, depending on which underlying filesystem is being used (i.e. gcsfs for GCS, s3fs for S3) -- and those could have additional dependencies. Datajoint could include some of those packages/dependencies, or leave it up to the user to install what they need. I'd personally lean toward having the user install what they need, but the fact that datajoint currently brings in S3 dependencies may have already set a different precedent. |
Is there still interest in this? |
I want to use my dropbox business account as external drive location to my datajoint schema. Since dropbox is not S3, this option is currently not available - but, as I understand, this PR + some additions (https://github.com/fsspec/dropboxdrivefs), would enable it? I'd be grateful for any tips. |
Ok, let's test and merge this and test with Dropbox. Thank you for finding this @horsto and thank you for your patience @chrisroat. |
@chrisroat DataJoint core team has opened some bandwidth and we are finally coming back to review the outstanding PRs that are in the queue. Thinking back on your contribution, I really like the idea of conforming to a single interface to simplify how we add support for new external storage options. On that note, why did you choose Some additional context, we've now had much more experience using s3fs-fuse and it is demonstrating amazing performance! We've been mounting S3 BLOBs (>150GB) and accessing the files as if they were local from Python with read/write speeds consistent with online speedtests (900 Mbps in our HQ). FUSE projects have been around for quite some time and the above in particular has: caching features, strong community acceptance, compatibility for STS, and more. Perhaps the common interface is simply treating objects like a normal file and letting FUSE do the rest? 🤷 Curious on your thoughts. |
rclone seems like a very promising one as well. |
You can use fsspec with fuse-mounted filesystems, so I don't think they are competitors in that way. See the docs: I think fsspec is heavily used in the pydata ecosystem (i.e. pandas, dask), so it is a natural to support it. You'll benefit from improvements and interoperability. Regarding rclone, I note that someone even tried to make an fsspec implementation for it (it doesn't look supported, though): |
ok, let's integrate it before we add new file management |
Hi @chrisroat, Thank you for your patience. We are working on merging this. We just need to redo the tests in |
Sorry. I haven’t been working in this space for some time.
…On Fri, Sep 13, 2024 at 09:12 Dimitri Yatsenko ***@***.***> wrote:
Hi @chrisroat <https://github.com/chrisroat>, Thank you for your
patience. We are working on merging this. We just need to redo the tests in
pytest. If you are unavailable for this, we will re-open a new PR where
we will fix this.
—
Reply to this email directly, view it on GitHub
<#946 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIBDYMCQWJV4CHJ5UBHWTLZWMFIDAVCNFSM6AAAAABOEGYGIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBZGMYDSMJYGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@dimitri-yatsenko Okay, I'll take care of migration to pytest. |
@chrisroat Thank you for your contribution. We will close your PR and @ethho will reopen a new one with your changes. |
In principle, fsspec can completely subsume the current protocols (file and s3) and can provide many more. However, since it is a new integration and getting credentialling correct can be difficult, I implemented it as an alternate protocol which people can migrate to.
Some s3fs and gcsfs dependencies (installed via fsspec[s3,gcs]) do not have wheels for alpine, so they need to be built during startup. To correctly run the tests, the following extra install of dev packages is needed:
apk add musl-dev
Addresses #911