-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POC: filepath GCS support #837
Conversation
Use gcsfs package, based on fsspec. It provides a common interface across filesystems, so one could unifying the code across local/gcs/s3 and many others. Different than the s3 library, the bucket is not a separate entity -- the location is 'gs://bucket/path'. Pathlib does not like double-slashes, so I put in hack to concat strings. TESTED: Would need a local GCS mock. For now, I used a test bucket with personal credentials.
If this seems useful, then I'd need some guidance on if you're happy with the gcsfs dep, and in regards to mocking gcs and handling paths properly. |
@@ -73,6 +73,12 @@ def s3(self): | |||
self._s3 = s3.Folder(**self.spec) | |||
return self._s3 | |||
|
|||
@property | |||
def gcs(self): | |||
# TODO: Add gcsfs dependency? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, adding gcsfs dependency would be consistent with how S3 is managed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Add dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chrisroat Thanks for the PR! This is a bold effort and looks promising! Shared my feedback to bring the implementation of the interface closer in line with how the s3
was done and some instructions how we can include your tests.
Also, there has since been a move from TravisCI to GitHub Actions for our tests. Would you re-pull from datajoint:master
to properly sync up the tests?
def gcs(self): | ||
# TODO: Add gcsfs dependency? | ||
import gcsfs | ||
return gcsfs.GCSFileSystem(token=self.spec['token']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than inserting the interface directly to gcfs
here, I would recommend to create a gcfs
module within datajoint
aligned with the common interface structure we have in datajoint.s3.Folder
. That way it is consistent to say make operations when working directly with an ExternalTable instance e.g. schema.external[store].gcs.exists(..)
, schema.external[store].gcs.get_size(..)
, schema.external[store].gcs.remove_object(..)
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you are asking is to have a module with a class that effectively is renaming the methods to provide the same interface as the s3 module?
Do you think there would be a desire to instead move the other way, and bring the s3 work toward the common fsspec framework using s3fs, which is used in projects like pandas? You'd get many other filesystems for free, as well (e.g. azure).
https://pandas.pydata.org/docs/whatsnew/v1.1.0.html#fsspec-now-used-for-filesystem-handling
pandas-dev/pandas#33452
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Thanks for raising this, @chrisroat. I am not familiar with this but let me look into it. At a cursory glance, it seems to satisfy what we are looking for but just need to confirm any performance impact (if at all---though pandas
adoption is a good sign).
@@ -45,6 +45,8 @@ | |||
Path(__file__).resolve().parent, | |||
'external-legacy-data', 's3').iterdir()][0] | |||
|
|||
GCS_CONN_INFO = dict(token=environ.get('GOOGLE_APPLICATION_CREDENTIALS')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To properly include your tests into our tests, we will need to find a self-contained way to run a mini GCS service within our LNX-docker-compose.yml
and local-docker-compose.yml
. That way each clone of our repo can independently verify tests have passed for this new store. Have a look at minio
(our small, local S3 service for testing), fakeservices.datajoint.io
(our reverse proxy to expose a central host), and the env variables defined in app
(the service where the tests are run against datajoint
) for reference. A quick look turns up an example such as this fake-gcs-server. Something of this sort would be needed.
Removing in favor of #946 |
Fix #750
Use gcsfs package, based on fsspec. It provides a common interface across filesystems, so one could unifying the code across local/gcs/s3 and many others.
Different than the s3 library, the bucket is not a separate entity -- the location is 'gs://bucket/path'. Pathlib does not like double-slashes, so I put in hack to concat strings.
TESTED: Would need a local GCS mock. For now, I used a test bucket with personal credentials.