Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Providing more metadata than access_url with GetAccessURLResponse? #239

Closed
briandoconnor opened this issue Mar 13, 2019 · 6 comments
Closed

Comments

@briandoconnor
Copy link
Contributor

This is a followup to our conversation on 20190311 and PR #236

See #236 (comment)

GetAccessURLResponse just has access_url in the response. Do we want to expand that to include more metadata in the future so DRS implementors can have the option to pass back more info than just a URL? This would be needed if DRS implementors want to use DRS to hand back enough information to "get the bytes" for a variety of protocols beyond URLs/signed URLs.

@sarpera
Copy link
Contributor

sarpera commented Mar 13, 2019

Sort of related to this, adding authorization_metadata property to the items in access_methods array was also discussed in this issue as well.

So far:
GetAccessURLResponse via /objects/<object_id>/access/<access_id> is meant to return a ready-to-use byte access URL (e.g signed URL) already. Client is assumed to already have chosen an access method from the access_methods array and passed its access_id to above path to (optionally with auth token) get a url to bytes.

Suggestion:
We can extend this for the cases like aspera, globus etc where access_url needs to be used along with some additional information like temp. credentials. I think it still doesn't defeat the original purpose; separating object metadata and access url that requires additional steps (signing, providing temp. credentials etc) to provide.

Quoting from the the discussion PR 236:

In the previous response that came back we returned a URL along with system_metadata, user_metadata, and authorization_metadata. In the NIH Data Commons we needed the ability to provide more context about the URL to allow a user to make a request.

Some examples how we can extend the original PR based on the use cases:

Assumptions:

  • Both /objects/<id> and /objects/<object_id>/access/<access_id> are optionally behind some auth mechanism. We can easily add BasicAuth, 0Auth2 etc in securityDefinitions so that all standard auth mechanisms are allowed in DRS and calls to DRS are not ambiguous.
  • Providing access_url may be costly and therefore unnecessary to provide each time an object method is requested
  • Object metadata might be publicly accessible, but access url might require some auth

Use Case 1

  • DRS contains an object that is stored in a private-access storage system
  • Client doesn't have direct access to storage system; action required to get access (sign, get credentials/access token etc)
  • A URL alone is enough to provide byte-access

Step 1: Client requests the object metadata via /objects/<object_id>:

# ... 
"access_methods": [
    {
        "type": "s3",
        "access_id": "s3-1",
        "region": "us-east-1"     
    }, # ...
]

Step 2: Client requests an access url via /objects/<object_id>/access/<access_id> providing auth token in the request header:

{
    "access_url": "<byte-access-url>"
}

Use Case 2

  • DRS contains an object that is stored in a private-access storage system
  • Client has direct access to the storage system
  • A URL alone is enough to provide byte-access

Step 1: Client requests the object metadata via /objects/<object_id> and consumes an access_url as is:

# ... 
"access_methods": [
    {
        "type": "s3",
        "access_url": "s3://foo/bar.bam",
        "region": "us-east-1"     
    }, # ...
]

Use Case 3

  • DRS contains an object that is stored in a private-access storage system
  • Client doesn't have direct access to storage system; action required to get access (sign, get credentials/access token etc)
  • A URL is NOT enough to provide byte-access

Step 1: Client requests the object metadata via /objects/<object_id>:

# ... 
"access_methods": [
    {
        "type": "aspera",
        "access_id": "aspera-1"
    }, # ...
]

Step 2: Client requests access url and additional information via /objects/<object_id>/access/<access_id> providing auth token in the request header:

{
    "access_url": "<byte-access-url>",
    "access_credentials": {
        "foo": "bar"   
     } 
}

We can provide a structured vocabulary for the response in the schema.

Use Case 4

  • DRS contains an object that is stored in a public-access storage system
  • Client can access the object from any access_url from the metadata

Step 1: Client requests the object metadata via /objects/<object_id> and consumes an access_url as is:

# ... 
"access_methods": [
    {
        "type": "s3",
        "access_url": "s3://foo/bar.bam",
        "region": "us-east-1"     
    }, # ...
]

How does it sound like? Are those use cases accurate?

@dglazer
Copy link
Member

dglazer commented Mar 14, 2019

Nice breakdown @sarpera . A few thoughts:

  • I don't understand the difference between use case 2 and 4. Maybe one of them was meant to be an http: URL? Or maybe one of them implies some out-of-band auth knowledge?
  • I think there's one more use case (which may be the intent of your use case 4), where the client is expected to have out-of-band auth knowledge (e.g. "I will always use this pet service account to fetch data from this GCP bucket"). As far as the DRS protocol is concerned it looks like a public access_url; the specific DRS server documentation would include more details.
  • I think Issue #213: Access methods and access path for an object #236 fully addresses use cases 1 and 2/4, as long as our documentation is clear that for some types (e.g. type: s3), the access_url is a storage-provider-native URL that clients are expected to know how to handle (not an HTTP GETtable URL).
  • I think use case 3 captures @briandoconnor 's specific request well, and can be addressed in a simple PR that adds auth info to the response to /objects/<object_id>/access/<access_id>. We can hash out the specific syntax there -- my first thought is to keep it very generic, and just have 0 or more optional name:value access_headers that the client should include when doing an HTTP GET of the access_url.
  • Nit: if we want to include this breakdown in future documentation or slides, I'd probably reorder it so the simpler one-step use cases (2/4) come before the more complicated two-step use cases (1 and 3).

@sarpera
Copy link
Contributor

sarpera commented Mar 14, 2019

@dglazer

I don't understand the difference between use case 2 and 4. Maybe one of them was meant to be an http: URL? Or maybe one of them implies some out-of-band auth knowledge?

Use cases 2 & 4 are not different at all when it comes to using DRS to access bytes . I added the scenarios for each use cases to point out #213 can cover them, and how we can extend it to cover the use case 3.

I think there's one more use case (which may be the intent of your use case 4), where the client is expected to have out-of-band auth knowledge (e.g. "I will always use this pet service account to fetch data from this GCP bucket"). As far as the DRS protocol is concerned it looks like a public access_url; the specific DRS server documentation would include more details.

There is no auth for the use case 4, scenario is "DRS contains an object that is stored in a public-access storage system". I guess what you mentioned aligns better with the use case 2; where the client has direct access to underlying storage (out-of-band auth knowledge) and does not need more info from DRS to access the private/controlled access data, besides its URL; e.g direct bucket access to s3. There might be more use cases as you implied, adding them would help us see the bigger picture.

I think #236 fully addresses use cases 1 and 2/4, as long as our documentation is clear that for some types (e.g. type: s3), the access_url is a storage-provider-native URL that clients are expected to know how to handle (not an HTTP GETtable URL).

Yeah, exactly. Hope more drivers can also confirm this against their use cases.

I think use case 3 captures @briandoconnor 's specific request well, and can be addressed in a simple PR that adds auth info to the response to /objects/<object_id>/access/<access_id>. We can hash out the specific syntax there -- my first thought is to keep it very generic, and just have 0 or more optional name:value access_headers that the client should include when doing an HTTP GET of the access_url.

Yes and I agree, though I would prefer eventually having a documented syntax even if it means we would end up having different responses depending on the access method. More input from drivers who have this use case will help; aspera, globus etc

Nit: if we want to include this breakdown in future documentation or slides, I'd probably reorder it so the simpler one-step use cases (2/4) come before the more complicated two-step use cases (1 and 3).

+1

@dglazer
Copy link
Member

dglazer commented Mar 16, 2019

Got it -- I agree, your use case 2 is the "out-of-band auth" scenario I was thinking of. Echoing them back to make sure we're in sync:

If that's right, I like it, because the rules for building a client are straightforward:

  • use /objects/<id> to get the list of access_methods
  • pick the one you want to use (using whatever "scoring" function makes sense in your environment)
  • if it has an access_url you know how to use (either as is, or because you have special oob auth knowledge), fetch the object directly
  • if not, use /objects/<object_id>/access/<access_id> to get more details, and use the returned access_url and headers (if any) to fetch the object

@sarpera
Copy link
Contributor

sarpera commented Mar 18, 2019

@dglazer yes, your summary nails it! Let's see #236 and this (#239) can cover the use cases for the rest of the drivers.

@rishidev rishidev removed the Due: Mar label Mar 25, 2019
dglazer added a commit that referenced this issue Apr 19, 2019
* small doc cleanup

Added a few method descriptions (and shortened method summaries).

* add AccessURL object (to allow headers)

* reconcile AccessURL with #243

* switch headers from map to array

* small documentation cleanup

* flesh out the auth section
@dglazer
Copy link
Member

dglazer commented Apr 19, 2019

PR #248 is now merged.

@dglazer dglazer closed this as completed Apr 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants