Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[catalystproject-latam, unam] Enable object storage #4214

Closed
2 tasks done
Tracked by #226
jnywong opened this issue Jun 12, 2024 · 20 comments · Fixed by 2i2c-org/docs#236
Closed
2 tasks done
Tracked by #226

[catalystproject-latam, unam] Enable object storage #4214

jnywong opened this issue Jun 12, 2024 · 20 comments · Fixed by 2i2c-org/docs#236
Assignees

Comments

@jnywong
Copy link
Member

jnywong commented Jun 12, 2024

Context

The unam community will be participating in the upcoming CAMDA competition mid-July. It would be really great to set them up for success and capture this as a demo of how we can enable bioscience workflows for a global south community.

Proposal

They had a scratch bucket set up recently, but would really benefit from a "LEAP" style method of data transfer with a persistent bucket in order to "input" data into the hub.

This relates to the workflow proposed in #4213

Updates and actions

  • Add the community champion to the google group so they can add their community members
  • Reuse any relevant info from LEAP documentation to guide the community on writing to this bucket from outside the hub
@sgibson91 sgibson91 self-assigned this Jun 13, 2024
@sgibson91
Copy link
Member

I will pick this up today

@sgibson91
Copy link
Member

sgibson91 commented Jun 13, 2024

@jnywong Following the instructions in the docs, I have created a google group that will grant permission to write to a new persistent bucket from outside the hub: https://groups.google.com/u/1/a/2i2c.org/g/persistent-unam-writers I have added you as an owner and request that you add the hub community champion to the group too (as Group Owner) so that they may add community members (as Group Members) to the group without bottle-necking on 2i2c staff doing the work.

@sgibson91
Copy link
Member

sgibson91 commented Jun 13, 2024

There is now a persistent bucket setup for the UNAM community

Screenshot 2024-06-13 at 14 20 31

I think the remaining todo's are:

  • Add the community champion to the google group so they can add their community members
  • Reuse any relevant info from LEAP documentation to guide the community on writing to this bucket from outside the hub

@jnywong
Copy link
Member Author

jnywong commented Jun 13, 2024

Fab! I'll pick up those tasks. Thank you Sarah ☺️

@sgibson91
Copy link
Member

sgibson91 commented Jun 13, 2024

I'm going to unassign myself and remove this from the engineering board - but feel free to pull me back in if something isn't working

@jnywong
Copy link
Member Author

jnywong commented Jun 24, 2024

Issue with gcloud web app auth

Context

I am reproducing the steps in the LEAP documentation, specifically the section Uploading large original data from an HPC system (no browser access on the system available).

I have verified that the method preceding this section, Upload medium sized original data from your local machine works, so I can confirm that the bucket is public and that I can write to it from my local machine.

The issue I think is if you look closer at the first command

gcloud auth application-default login --scopes=https://www.googleapis.com/auth/devstorage.read_write,https://www.googleapis.com/auth/iam.test --no-browser

The scopes allude to iam.test, so I suspect there are specific IAM roles that need to be enabled for this to work.

Error message

gcloud storage ls $SCRATCH_BUCKET
ERROR: (gcloud.storage.ls) HTTPError 403: [email protected] does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist). This command is authenticated as [email protected] which is the active account specified by the [core/account] property.

@jnywong jnywong self-assigned this Jun 24, 2024
@consideRatio
Copy link
Member

I'm back at work tomorrow, but I think the key crux is the assumptions.

  • JupyterHub users on their user servers have credentials that can be used to work against the bucket, and temporary credentials can be extracted from there to a local computer for use up to an hour. The LEAP docs sais this:

    For medium sized datasets, that can be uploaded within an hour, you can use a temporary access token generated on the JupyterHub to upload data to the cloud.

  • The "upload large files" etc category is probably one where this "temporary" part becomes a problem and the upload takes longer than an hour. This strategy relies on something out of the ordinary with regards to cloud permissions -- personal google account permissions setup against the cloud account.

    Due to this, for this procedure to work, you need to manually add permissions to your personal google cloud account ahead of time.

@consideRatio
Copy link
Member

I'm not sure if we have guidance and documentation on: if, when, and how to provide users personal cloud account access, and whats the minimal access required to work against specific buckets if that is the goal.

@jnywong
Copy link
Member Author

jnywong commented Jun 24, 2024

The "upload large files" etc category is probably one where this "temporary" part becomes a problem and the upload takes longer than an hour. This strategy relies on something out of the ordinary with regards to cloud permissions -- personal google account permissions setup against the cloud account.

Yes, I am focused on this scenario!

I'm not sure if we have guidance and documentation on: if, when, and how to provide users personal cloud account access, and whats the minimal access required to work against specific buckets if that is the goal.

Does anyone know what was done for the LEAP hub?

@jnywong
Copy link
Member Author

jnywong commented Jun 24, 2024

Hey @jbusecke ! I was wondering if you know anything about the above comment?

I'm working on generalising the wonderful LEAP documentation you have written for uploading large datasets from an HPC system for other 2i2c communities and ran into this issue with reproducing your workflow.

Here is a preview of what I have written so far, with my issue arising in this section.

@jbusecke
Copy link
Contributor

Hey @jnywong, as a matter of fact one of our users also ran into this exact issue.

These docs look great btw! Once these are done I should def link them in our docs!

Unfortunately I have 0 clue how this is working behind the scenes 😩. I think @yuvipanda helped set some of this up originally, maybe he has better feedback.

@jnywong
Copy link
Member Author

jnywong commented Jun 24, 2024

Interesting feedback about how you have seen this issue replicated elsewhere! Thank you for your insights 🙏

@jbusecke
Copy link
Contributor

We are tracking that issue here internally
cc @suryadheeshjith

@jnywong
Copy link
Member Author

jnywong commented Jun 25, 2024

Thanks for that @jbusecke !

Can you tell us whether this affects @suryadheeshjith only, or is this affecting every hub user?

@consideRatio
Copy link
Member

Can you tell us whether this affects @suryadheeshjith only, or is this affecting every hub user?

The temporary token approach works for all JupyterHub users, but the "large files" or "more than 60 minute access" approach only works for those with direct access to a cloud account. Currently, 2i2c engineers (as defined by a GCP group, @jnywong i just added you there!) and Julius have such access to the leap's GCP project.

I think this kind of access has only been provided ad-hoc by 2i2c to individual power users like Julius, and we haven't come up with a way to do it sustainably to all users.

Related

@jnywong
Copy link
Member Author

jnywong commented Jun 25, 2024

I suspected as much! I can confirm the workflow is working as expected. @consideRatio thank you for adding me to the GCP group, this makes it easier for me to investigate these issues myself in future.

I will document that this is not a supported feature for everyone due to the 💰 💰 💰 involved.

@consideRatio
Copy link
Member

consideRatio commented Jun 25, 2024

not a supported feature for everyone due to the 💰 💰 💰 involved.

The key thing isn't money for cloud resources - its just that we haven't a way of doing this in a way that scales well with regards to security and maintenance burden (so it would cost us a lot to invest 2i2c and community users time handling this currently).

The crux is that our "jupyterhub users" doesn't associate with "cloud provider users", and that makes us not able to grant direct cloud permissions to individual jupyterhub users, forcing us to create individual cloud accounts when needed. In practice from the perspective of the cloud provider when jupyterhub users access the object storage, its access made from the same cloud provider user/identity, and we haven't been giving out direct persistent access to that.

@jnywong
Copy link
Member Author

jnywong commented Jun 25, 2024

Thanks for the explanation Erik, I will capture this insight for our Product board.

Regarding cloud permissions, do you happen to know what the Google Group @sgibson91 mentioned above is for then? Here are the relevant infrastructure docs.

@jbusecke
Copy link
Contributor

The temporary token approach works for all JupyterHub users, but the "large files" or "more than 60 minute access" approach only works for those with direct access to a cloud account. Currently, 2i2c engineers (as defined by a GCP group, @jnywong i just added you there!) and Julius have such access to the leap's GCP project.

We actually have a google group that I manage (and where I added Surya), is that method defunct?

@jnywong
Copy link
Member Author

jnywong commented Jun 25, 2024

@jbusecke it doesn't seem to be working as expected. #4281 will investigate 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants