Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File sources for gdrive, gcs, onedata, basespace #12500

Merged
merged 11 commits into from
Dec 21, 2021

Conversation

nuwang
Copy link
Member

@nuwang nuwang commented Sep 19, 2021

This PR partly addresses: #11784

It includes filesources for google drive, google cloud storage and onedata. While these work in varying degrees, they all have issues that hamper their use.

Google Drive

Google drive in particular requires an oauth2 flow to obtain user credentials. Without adding client side support to authorize access, these are extremely tedious for end-users to obtain.

Google Cloud Storage

Google cloud storage also requires oauth2 credentials, but presumably, these could be configured by admins as opposed to end-users. It should also work without credentials in Google cloud. However, there's a library clash with CloudVE/cloudbridge#275 which will be sorted out once a new version is released.

One Data

Onedata does not work because of: onedata/fs-onedatafs#5
Should it be resolved, the code should in principle work, but I've squashed it to one commit so we can drop it easily.

BaseSpace

See separate comment below.

Summary

There are concerns about the practical usability of all of these providers unfortunately. We should probably add some kind of oauth2 authorization mechanisms to smoothen the flow. That would roughly entail:

  1. Registering an app with each provider. E.g. A Google app which requires full gdrive permissions - this also needs some kind of verification process from Google as we would need to request full permissions on gdrive.
  2. Displaying a link for each filesource that would take the user to the relevant oauth2 authorization page.
  3. Adding an endpoint to Galaxy to receive the oauth2 callback
  4. Upon receiving the callback, saving the oauth2 auth token and refresh token in the user's profile and eventually, a vault.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these contributions under Galaxy's current license.
  • I agree to allow the Galaxy committers to license these and all my past contributions to the core galaxy codebase under the MIT license. If this condition is an issue, uncheck and just let us know why with an e-mail to [email protected].

@nuwang nuwang requested a review from bgruening September 19, 2021 18:52
@nuwang nuwang changed the title [WIP] Papercut file sources for gdrive, gcs, onedata etc. [WIP] File sources for gdrive, gcs, onedata etc. Sep 19, 2021
@nuwang nuwang changed the title [WIP] File sources for gdrive, gcs, onedata etc. [WIP] File sources for gdrive, gcs, onedata, basespace Sep 20, 2021
@nuwang
Copy link
Member Author

nuwang commented Sep 20, 2021

Have also added a filesource for basespace. However, the python basespacesdk itself was last released in 2018 and the pyfilesystem2 plugin needed some patches for things to work.
emedgene/fs_basespace#9

In addition, the basespacesdk appears to ignore the provided credentials and expect them in: '/home/.basespace/default.cfg'. That too requires a patch: basespace/basespace-python-sdk#32

Once configured though, it does work fine. One issue is that basespace returns a numeric id for the filename, and the actual filename as 'alias': https://github.com/emedgene/fs_basespace/blob/bb560444bdec3cbcbc5d66344f877bd823a4bfb6/fs_basespace/_basespacefs.py#L132

This makes it difficult to use in practice. @jmchilton Any thoughts on this?

@nuwang nuwang force-pushed the papercut_file_sources branch from f7d58ce to 1e9d5ce Compare September 20, 2021 03:06
@nuwang nuwang force-pushed the papercut_file_sources branch from 1e9d5ce to 6450d41 Compare December 8, 2021 05:22
@nuwang nuwang changed the title [WIP] File sources for gdrive, gcs, onedata, basespace File sources for gdrive, gcs, onedata, basespace Dec 10, 2021
@github-actions github-actions bot added this to the 22.01 milestone Dec 10, 2021
@nuwang
Copy link
Member Author

nuwang commented Dec 10, 2021

This should now be good to go. The cloudbridge update has been made and upstream libraries fixed. Onedata continues to have the installation and configuration issue, but this can in principle be done, even if the process is a bit tedious, so I'd propose we merge this anyway in the hope that those issues are resolved upstream at a later date.

Adding oauth support to streamline the process of obtaining tokens etc. would also be a future enhancement.

@nuwang nuwang force-pushed the papercut_file_sources branch from 0f48e65 to 3ee1042 Compare December 11, 2021 04:06
@nuwang nuwang requested a review from afgane December 14, 2021 04:57
@luke-c-sargent
Copy link
Member

luke-c-sargent commented Dec 14, 2021

@nuwang good stuff! I tested out GCS/GDrive access, and had some success and some issues; preliminary thoughts:

Google Drive:

  • binary files work a-ok
  • I tried a Google doc download, and it gave:
    "Only files with binary content can be downloaded. Use Export with Google Docs files."

GCS:

  • created an open access bucket, able to browse it in Galaxy, but get Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS when trying to download any files
    • to resolve this error when testing the anvil pyfilesystem2 plugin i prepend an export of the above env var to sh run.sh, but that does not seem to work in this case

I will keep testing; please let me know if there is something I am missing as its entirely possible i've just borked up my configuration

@nuwang
Copy link
Member Author

nuwang commented Dec 15, 2021

@luke-c-sargent Thanks for testing. The google drive issues is more of an upstream issue I guess: https://github.com/rkhwaja/fs.googledrivefs/blob/9568eb49a9084d4d9751eb27fb558ee77558cbfc/fs/googledrivefs/googledrivefs.py#L427
As that Stack Overflow post you linked suggests, docs need the export_media method, but I don't know whether that breaks normal files. I guess we should file a bug there?

Regarding GCS, this will work if you plug in your OIDC credentials. However, there is an anonymous access mode for public buckets, let me see whether I can activate that, but we'll need an extra setting like: anonymous: true in file_sources_conf.

@luke-c-sargent
Copy link
Member

I guess we should file a bug there?

👍 issue filed. TL;DR: they could detect mimetype and change the download mechanism for google native items easily enough; the only considerations i can see are a) what type to export stuff to (e.g. sheets as csv or excel?) and b) what to do about files >10MB that export_media() disallows.

Regarding GCS, this will work if you plug in your OIDC credentials

ahh i only used parameters from test/unit/files/gcsfs_file_sources_conf.yml, and as such did not have any tokens in the mix. works now!

one additional note - when using an old token the refresh fails with:

 Problem listing file source path FileSourcePath(file_source=<galaxy.files.sources.googledrive.GoogleDriveFilesSource object at 0x120b968e0>, path='/')
<...>
google.auth.exceptions.RefreshError: The credentials do not contain the necessary fields need to refresh the access token. You must specify refresh_token, token_uri, client_id, and client_secret.

you mentioned that this is preliminary work that requires a better OIDC solution ...is the refresh token a placeholder for future functionality or have i misconfig'd something again?

@nuwang nuwang force-pushed the papercut_file_sources branch from 3ee1042 to 4183fee Compare December 19, 2021 17:49
@nuwang
Copy link
Member Author

nuwang commented Dec 19, 2021

@luke-c-sargent Thanks for filing the issue, looks like it's already been fixed since!

Have added back client_id, client_secret and token_uri, so that should now be specifiable along with the refresh token.

Also, I've added an anonymous: true option for GCS public buckets, so credentials will no longer be required.

@nuwang nuwang force-pushed the papercut_file_sources branch 2 times, most recently from 1afb02b to 0f46edc Compare December 20, 2021 05:37
@nuwang nuwang force-pushed the papercut_file_sources branch 2 times, most recently from 760e90c to 2188b26 Compare December 20, 2021 06:06
@nuwang
Copy link
Member Author

nuwang commented Dec 20, 2021

@mvdbeek Would you be able to give a quick once over on this? In particular, whether the regeneration of requirements looks ok?

pyproject.toml Outdated
@@ -116,3 +116,4 @@ testfixtures = "*"
tuspy = "*"
twill = "*"
watchdog = "*"
fs-gcsfs = "*"
Copy link
Member

@mvdbeek mvdbeek Dec 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be a conditional requirement and not appear in pyproject.toml

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I specify it as a conditional requirement only for tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd maybe do it the other way round and skip the test if the dependency can't be imported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are you suggesting not running the test by default? Since it's accessing a public bucket, the test can be run successfully, provided fs-gcsfs is available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ideally unit tests shouldn't contact external services. We can run release tests with the conditional dependencies installed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, fs-gcsfs appears to have been correctly included in dev-requirements, but not pinned requirements. Isn't that ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though we already do that for some other tests, I guess that is just a personal preference. If you want to run this one then this is ok and we do need to add it

Copy link
Member Author

@nuwang nuwang Dec 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rolled back the always on gcsfs test in favour of skipping the test if the dependency is unavailable.

)


@skip_if_no_basespace_access_token
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added unit tests are all using the same test structure, can you parameterize them ?

Copy link
Member Author

@nuwang nuwang Dec 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think they can be. It's something that's generally the case for many file source tests. Maybe a refactoring run that should be done outside of this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you at least push _configured_file_sources into the helper ? That is literally the same in all those tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have refactored the tests and helpers. Can you take a look?

@nuwang nuwang force-pushed the papercut_file_sources branch from 2188b26 to 2be4b40 Compare December 20, 2021 19:18
@nuwang nuwang force-pushed the papercut_file_sources branch from 2be4b40 to 0924628 Compare December 20, 2021 19:26
Copy link
Member

@mvdbeek mvdbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, I really appreciate the test simplification

@mvdbeek mvdbeek added highlight Included in user-facing release notes at the top kind/enhancement kind/feature labels Dec 21, 2021
@mvdbeek mvdbeek merged commit 9c92c1b into galaxyproject:dev Dec 21, 2021
@nuwang nuwang deleted the papercut_file_sources branch December 21, 2021 15:33
@astrovsky01 astrovsky01 mentioned this pull request Feb 7, 2022
40 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants