Providing more metadata than access_url with GetAccessURLResponse? #239

briandoconnor · 2019-03-13T01:22:34Z

This is a followup to our conversation on 20190311 and PR #236

GetAccessURLResponse just has access_url in the response. Do we want to expand that to include more metadata in the future so DRS implementors can have the option to pass back more info than just a URL? This would be needed if DRS implementors want to use DRS to hand back enough information to "get the bytes" for a variety of protocols beyond URLs/signed URLs.

The text was updated successfully, but these errors were encountered:

sarpera · 2019-03-13T17:51:00Z

Sort of related to this, adding authorization_metadata property to the items in access_methods array was also discussed in this issue as well.

So far:
GetAccessURLResponse via /objects/<object_id>/access/<access_id> is meant to return a ready-to-use byte access URL (e.g signed URL) already. Client is assumed to already have chosen an access method from the access_methods array and passed its access_id to above path to (optionally with auth token) get a url to bytes.

Suggestion:
We can extend this for the cases like aspera, globus etc where access_url needs to be used along with some additional information like temp. credentials. I think it still doesn't defeat the original purpose; separating object metadata and access url that requires additional steps (signing, providing temp. credentials etc) to provide.

Quoting from the the discussion PR 236:

In the previous response that came back we returned a URL along with system_metadata, user_metadata, and authorization_metadata. In the NIH Data Commons we needed the ability to provide more context about the URL to allow a user to make a request.

Some examples how we can extend the original PR based on the use cases:

Assumptions:

Both /objects/<id> and /objects/<object_id>/access/<access_id> are optionally behind some auth mechanism. We can easily add BasicAuth, 0Auth2 etc in securityDefinitions so that all standard auth mechanisms are allowed in DRS and calls to DRS are not ambiguous.
Providing access_url may be costly and therefore unnecessary to provide each time an object method is requested
Object metadata might be publicly accessible, but access url might require some auth

Use Case 1

DRS contains an object that is stored in a private-access storage system
Client doesn't have direct access to storage system; action required to get access (sign, get credentials/access token etc)
A URL alone is enough to provide byte-access

Step 1: Client requests the object metadata via /objects/<object_id>:

# ... 
"access_methods": [
    {
        "type": "s3",
        "access_id": "s3-1",
        "region": "us-east-1"     
    }, # ...
]

Step 2: Client requests an access url via /objects/<object_id>/access/<access_id> providing auth token in the request header:

{
    "access_url": "<byte-access-url>"
}

Use Case 2

DRS contains an object that is stored in a private-access storage system
Client has direct access to the storage system
A URL alone is enough to provide byte-access

Step 1: Client requests the object metadata via /objects/<object_id> and consumes an access_url as is:

# ... 
"access_methods": [
    {
        "type": "s3",
        "access_url": "s3://foo/bar.bam",
        "region": "us-east-1"     
    }, # ...
]

Use Case 3

DRS contains an object that is stored in a private-access storage system
Client doesn't have direct access to storage system; action required to get access (sign, get credentials/access token etc)
A URL is NOT enough to provide byte-access

Step 1: Client requests the object metadata via /objects/<object_id>:

# ... 
"access_methods": [
    {
        "type": "aspera",
        "access_id": "aspera-1"
    }, # ...
]

Step 2: Client requests access url and additional information via /objects/<object_id>/access/<access_id> providing auth token in the request header:

{
    "access_url": "<byte-access-url>",
    "access_credentials": {
        "foo": "bar"   
     } 
}

We can provide a structured vocabulary for the response in the schema.

Use Case 4

DRS contains an object that is stored in a public-access storage system
Client can access the object from any access_url from the metadata

Step 1: Client requests the object metadata via /objects/<object_id> and consumes an access_url as is:

# ... 
"access_methods": [
    {
        "type": "s3",
        "access_url": "s3://foo/bar.bam",
        "region": "us-east-1"     
    }, # ...
]

How does it sound like? Are those use cases accurate?

dglazer · 2019-03-14T13:01:58Z

Nice breakdown @sarpera . A few thoughts:

I don't understand the difference between use case 2 and 4. Maybe one of them was meant to be an http: URL? Or maybe one of them implies some out-of-band auth knowledge?
I think there's one more use case (which may be the intent of your use case 4), where the client is expected to have out-of-band auth knowledge (e.g. "I will always use this pet service account to fetch data from this GCP bucket"). As far as the DRS protocol is concerned it looks like a public access_url; the specific DRS server documentation would include more details.
I think Issue #213: Access methods and access path for an object #236 fully addresses use cases 1 and 2/4, as long as our documentation is clear that for some types (e.g. type: s3), the access_url is a storage-provider-native URL that clients are expected to know how to handle (not an HTTP GETtable URL).
I think use case 3 captures @briandoconnor 's specific request well, and can be addressed in a simple PR that adds auth info to the response to /objects/<object_id>/access/<access_id>. We can hash out the specific syntax there -- my first thought is to keep it very generic, and just have 0 or more optional name:value access_headers that the client should include when doing an HTTP GET of the access_url.
Nit: if we want to include this breakdown in future documentation or slides, I'd probably reorder it so the simpler one-step use cases (2/4) come before the more complicated two-step use cases (1 and 3).

sarpera · 2019-03-14T16:47:02Z

@dglazer

I don't understand the difference between use case 2 and 4. Maybe one of them was meant to be an http: URL? Or maybe one of them implies some out-of-band auth knowledge?

Use cases 2 & 4 are not different at all when it comes to using DRS to access bytes . I added the scenarios for each use cases to point out #213 can cover them, and how we can extend it to cover the use case 3.

I think there's one more use case (which may be the intent of your use case 4), where the client is expected to have out-of-band auth knowledge (e.g. "I will always use this pet service account to fetch data from this GCP bucket"). As far as the DRS protocol is concerned it looks like a public access_url; the specific DRS server documentation would include more details.

There is no auth for the use case 4, scenario is "DRS contains an object that is stored in a public-access storage system". I guess what you mentioned aligns better with the use case 2; where the client has direct access to underlying storage (out-of-band auth knowledge) and does not need more info from DRS to access the private/controlled access data, besides its URL; e.g direct bucket access to s3. There might be more use cases as you implied, adding them would help us see the bigger picture.

I think #236 fully addresses use cases 1 and 2/4, as long as our documentation is clear that for some types (e.g. type: s3), the access_url is a storage-provider-native URL that clients are expected to know how to handle (not an HTTP GETtable URL).

Yeah, exactly. Hope more drivers can also confirm this against their use cases.

I think use case 3 captures @briandoconnor 's specific request well, and can be addressed in a simple PR that adds auth info to the response to /objects/<object_id>/access/<access_id>. We can hash out the specific syntax there -- my first thought is to keep it very generic, and just have 0 or more optional name:value access_headers that the client should include when doing an HTTP GET of the access_url.

Yes and I agree, though I would prefer eventually having a documented syntax even if it means we would end up having different responses depending on the access method. More input from drivers who have this use case will help; aspera, globus etc

Nit: if we want to include this breakdown in future documentation or slides, I'd probably reorder it so the simpler one-step use cases (2/4) come before the more complicated two-step use cases (1 and 3).

+1

dglazer · 2019-03-16T01:11:50Z

Got it -- I agree, your use case 2 is the "out-of-band auth" scenario I was thinking of. Echoing them back to make sure we're in sync:

use cases covered already by PR Issue #213: Access methods and access path for an object #236
- use case 4: public content; fetch using a one-step access_url
- use case 2: private content; fetch using a one-step access_url plus oob auth knowledge
- use case 1: private content; fetch using a two-step access_url
use cases covered by this issue Providing more metadata than access_url with GetAccessURLResponse? #239
- use case 3: private content; fetch using a two-step access_url plus DRS-returned auth knowledge

If that's right, I like it, because the rules for building a client are straightforward:

use /objects/<id> to get the list of access_methods
pick the one you want to use (using whatever "scoring" function makes sense in your environment)
if it has an access_url you know how to use (either as is, or because you have special oob auth knowledge), fetch the object directly
if not, use /objects/<object_id>/access/<access_id> to get more details, and use the returned access_url and headers (if any) to fetch the object

sarpera · 2019-03-18T14:59:20Z

@dglazer yes, your summary nails it! Let's see #236 and this (#239) can cover the use cases for the rest of the drivers.

* small doc cleanup Added a few method descriptions (and shortened method summaries). * add AccessURL object (to allow headers) * reconcile AccessURL with #243 * switch headers from map to array * small documentation cleanup * flesh out the auth section

dglazer · 2019-04-19T04:15:07Z

PR #248 is now merged.

briandoconnor added Status: Help Wanted Type: Enhancement Type: Schema Priority: Medium Project: DRS Due: Apr Due: Mar labels Mar 13, 2019

sarpera mentioned this issue Mar 13, 2019

Issue #213: Access methods and access path for an object #236

Merged

dglazer mentioned this issue Mar 20, 2019

Update schema based on Aaron and Phillis' auth doc #229

Closed

rishidev removed the Due: Mar label Mar 25, 2019

briandoconnor added the Priority: Critical label Mar 25, 2019

briandoconnor assigned dglazer Apr 1, 2019

dglazer mentioned this issue Apr 3, 2019

allow optional headers on access_url (resolves issue #239) #248

Merged

dglazer closed this as completed Apr 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Providing more metadata than access_url with GetAccessURLResponse? #239

Providing more metadata than access_url with GetAccessURLResponse? #239

briandoconnor commented Mar 13, 2019

sarpera commented Mar 13, 2019

dglazer commented Mar 14, 2019

sarpera commented Mar 14, 2019 •

edited

Loading

dglazer commented Mar 16, 2019

sarpera commented Mar 18, 2019

dglazer commented Apr 19, 2019

Providing more metadata than access_url with GetAccessURLResponse? #239

Providing more metadata than access_url with GetAccessURLResponse? #239

Comments

briandoconnor commented Mar 13, 2019

sarpera commented Mar 13, 2019

dglazer commented Mar 14, 2019

sarpera commented Mar 14, 2019 • edited Loading

dglazer commented Mar 16, 2019

sarpera commented Mar 18, 2019

dglazer commented Apr 19, 2019

sarpera commented Mar 14, 2019 •

edited

Loading