Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: filepath GCS support #837

Closed
wants to merge 2 commits into from

Conversation

chrisroat
Copy link
Contributor

@chrisroat chrisroat commented Dec 4, 2020

Fix #750

Use gcsfs package, based on fsspec. It provides a common interface across filesystems, so one could unifying the code across local/gcs/s3 and many others.

Different than the s3 library, the bucket is not a separate entity -- the location is 'gs://bucket/path'. Pathlib does not like double-slashes, so I put in hack to concat strings.

TESTED: Would need a local GCS mock. For now, I used a test bucket with personal credentials.

Use gcsfs package, based on fsspec. It provides a common interface across filesystems, so one could unifying the code across local/gcs/s3 and many others.

Different than the s3 library, the bucket is not a separate entity -- the location is 'gs://bucket/path'.  Pathlib does not like double-slashes, so I put in  hack to concat strings.

TESTED: Would need a local GCS mock.  For now, I used a test bucket with personal credentials.
@chrisroat
Copy link
Contributor Author

If this seems useful, then I'd need some guidance on if you're happy with the gcsfs dep, and in regards to mocking gcs and handling paths properly.

@@ -73,6 +73,12 @@ def s3(self):
self._s3 = s3.Folder(**self.spec)
return self._s3

@property
def gcs(self):
# TODO: Add gcsfs dependency?
Copy link
Member

@dimitri-yatsenko dimitri-yatsenko Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, adding gcsfs dependency would be consistent with how S3 is managed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Add dependency.

Copy link
Collaborator

@guzman-raphael guzman-raphael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrisroat Thanks for the PR! This is a bold effort and looks promising! Shared my feedback to bring the implementation of the interface closer in line with how the s3 was done and some instructions how we can include your tests.

Also, there has since been a move from TravisCI to GitHub Actions for our tests. Would you re-pull from datajoint:master to properly sync up the tests?

def gcs(self):
# TODO: Add gcsfs dependency?
import gcsfs
return gcsfs.GCSFileSystem(token=self.spec['token'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than inserting the interface directly to gcfs here, I would recommend to create a gcfs module within datajoint aligned with the common interface structure we have in datajoint.s3.Folder. That way it is consistent to say make operations when working directly with an ExternalTable instance e.g. schema.external[store].gcs.exists(..), schema.external[store].gcs.get_size(..), schema.external[store].gcs.remove_object(..), etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you are asking is to have a module with a class that effectively is renaming the methods to provide the same interface as the s3 module?

Do you think there would be a desire to instead move the other way, and bring the s3 work toward the common fsspec framework using s3fs, which is used in projects like pandas? You'd get many other filesystems for free, as well (e.g. azure).

https://pandas.pydata.org/docs/whatsnew/v1.1.0.html#fsspec-now-used-for-filesystem-handling
pandas-dev/pandas#33452

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Thanks for raising this, @chrisroat. I am not familiar with this but let me look into it. At a cursory glance, it seems to satisfy what we are looking for but just need to confirm any performance impact (if at all---though pandas adoption is a good sign).

@@ -45,6 +45,8 @@
Path(__file__).resolve().parent,
'external-legacy-data', 's3').iterdir()][0]

GCS_CONN_INFO = dict(token=environ.get('GOOGLE_APPLICATION_CREDENTIALS'))
Copy link
Collaborator

@guzman-raphael guzman-raphael Jan 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To properly include your tests into our tests, we will need to find a self-contained way to run a mini GCS service within our LNX-docker-compose.yml and local-docker-compose.yml. That way each clone of our repo can independently verify tests have passed for this new store. Have a look at minio (our small, local S3 service for testing), fakeservices.datajoint.io (our reverse proxy to expose a central host), and the env variables defined in app (the service where the tests are run against datajoint) for reference. A quick look turns up an example such as this fake-gcs-server. Something of this sort would be needed.

@chrisroat
Copy link
Contributor Author

Removing in favor of #946

@chrisroat chrisroat closed this Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

External storage support for Google Cloud Storage
3 participants