POC: filepath GCS support #837

chrisroat · 2020-12-04T05:38:17Z

Use gcsfs package, based on fsspec. It provides a common interface across filesystems, so one could unifying the code across local/gcs/s3 and many others.

Different than the s3 library, the bucket is not a separate entity -- the location is 'gs://bucket/path'. Pathlib does not like double-slashes, so I put in hack to concat strings.

TESTED: Would need a local GCS mock. For now, I used a test bucket with personal credentials.

Use gcsfs package, based on fsspec. It provides a common interface across filesystems, so one could unifying the code across local/gcs/s3 and many others. Different than the s3 library, the bucket is not a separate entity -- the location is 'gs://bucket/path'. Pathlib does not like double-slashes, so I put in hack to concat strings. TESTED: Would need a local GCS mock. For now, I used a test bucket with personal credentials.

chrisroat · 2020-12-04T05:39:47Z

If this seems useful, then I'd need some guidance on if you're happy with the gcsfs dep, and in regards to mocking gcs and handling paths properly.

dimitri-yatsenko · 2020-12-04T15:41:17Z

datajoint/external.py

@@ -73,6 +73,12 @@ def s3(self):
            self._s3 = s3.Folder(**self.spec)
        return self._s3

+    @property
+    def gcs(self):
+        # TODO: Add gcsfs dependency?


Yes, adding gcsfs dependency would be consistent with how S3 is managed.

OK. Add dependency.

guzman-raphael

@chrisroat Thanks for the PR! This is a bold effort and looks promising! Shared my feedback to bring the implementation of the interface closer in line with how the s3 was done and some instructions how we can include your tests.

Also, there has since been a move from TravisCI to GitHub Actions for our tests. Would you re-pull from datajoint:master to properly sync up the tests?

guzman-raphael · 2021-01-04T18:21:26Z

datajoint/external.py

+    def gcs(self):
+        # TODO: Add gcsfs dependency?
+        import gcsfs
+        return gcsfs.GCSFileSystem(token=self.spec['token'])


Rather than inserting the interface directly to gcfs here, I would recommend to create a gcfs module within datajoint aligned with the common interface structure we have in datajoint.s3.Folder. That way it is consistent to say make operations when working directly with an ExternalTable instance e.g. schema.external[store].gcs.exists(..), schema.external[store].gcs.get_size(..), schema.external[store].gcs.remove_object(..), etc.

I think what you are asking is to have a module with a class that effectively is renaming the methods to provide the same interface as the s3 module?

Do you think there would be a desire to instead move the other way, and bring the s3 work toward the common fsspec framework using s3fs, which is used in projects like pandas? You'd get many other filesystems for free, as well (e.g. azure).

https://pandas.pydata.org/docs/whatsnew/v1.1.0.html#fsspec-now-used-for-filesystem-handling
pandas-dev/pandas#33452

Good point! Thanks for raising this, @chrisroat. I am not familiar with this but let me look into it. At a cursory glance, it seems to satisfy what we are looking for but just need to confirm any performance impact (if at all---though pandas adoption is a good sign).

guzman-raphael · 2021-01-04T18:57:41Z

tests/__init__.py

@@ -45,6 +45,8 @@
        Path(__file__).resolve().parent,
        'external-legacy-data', 's3').iterdir()][0]

+GCS_CONN_INFO = dict(token=environ.get('GOOGLE_APPLICATION_CREDENTIALS'))


To properly include your tests into our tests, we will need to find a self-contained way to run a mini GCS service within our LNX-docker-compose.yml and local-docker-compose.yml. That way each clone of our repo can independently verify tests have passed for this new store. Have a look at minio (our small, local S3 service for testing), fakeservices.datajoint.io (our reverse proxy to expose a central host), and the env variables defined in app (the service where the tests are run against datajoint) for reference. A quick look turns up an example such as this fake-gcs-server. Something of this sort would be needed.

chrisroat · 2021-08-23T03:58:59Z

Removing in favor of #946

dimitri-yatsenko reviewed Dec 4, 2020

View reviewed changes

Add gcsfs dep

ca90c59

guzman-raphael requested changes Jan 4, 2021

View reviewed changes

chrisroat closed this Aug 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: filepath GCS support #837

POC: filepath GCS support #837

chrisroat commented Dec 4, 2020 •

edited by dimitri-yatsenko

Loading

chrisroat commented Dec 4, 2020

dimitri-yatsenko Dec 4, 2020 •

edited

Loading

chrisroat Dec 28, 2020

guzman-raphael left a comment •

edited

Loading

guzman-raphael Jan 4, 2021

chrisroat Jan 5, 2021

guzman-raphael Jan 5, 2021

guzman-raphael Jan 4, 2021 •

edited

Loading

chrisroat commented Aug 23, 2021

POC: filepath GCS support #837

POC: filepath GCS support #837

Conversation

chrisroat commented Dec 4, 2020 • edited by dimitri-yatsenko Loading

chrisroat commented Dec 4, 2020

dimitri-yatsenko Dec 4, 2020 • edited Loading

Choose a reason for hiding this comment

chrisroat Dec 28, 2020

Choose a reason for hiding this comment

guzman-raphael left a comment • edited Loading

Choose a reason for hiding this comment

guzman-raphael Jan 4, 2021

Choose a reason for hiding this comment

chrisroat Jan 5, 2021

Choose a reason for hiding this comment

guzman-raphael Jan 5, 2021

Choose a reason for hiding this comment

guzman-raphael Jan 4, 2021 • edited Loading

Choose a reason for hiding this comment

chrisroat commented Aug 23, 2021

chrisroat commented Dec 4, 2020 •

edited by dimitri-yatsenko

Loading

dimitri-yatsenko Dec 4, 2020 •

edited

Loading

guzman-raphael left a comment •

edited

Loading

guzman-raphael Jan 4, 2021 •

edited

Loading