-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Object metadata and download methods #213
Comments
@bwalsh @dglazer we have written up the proposal we had discussed yesterday at the hackathon. In this issue. @susheel would be great to know more on the FTP side, because we are not encountering that in our case, and also it would be good if @philloooo can you double check if this gets what we had written on the whiteboard. |
@mattions - all above looks reasonable. Can you provide a link to document where |
Hi @bwalsh, I was aiming to tag @briandoconnor that was present at the hackthon. My mistake, but The token will be dealt by the DRS server, and so they may offer a way to log in and it's a third part service. As for links, they will be available when the DRS will be up, it's my guess |
@mattions Just to make sure it's clear - as per this proposal, the use of signed URLs will be codified as the manner in which one obtains the bytes for an object? |
Curious why the plan is to name the header Also, I wonder if it might be too limiting to say that the |
@brucehoff My assumption was that it was to discern between "I have authZ to access the endpoint" (i.e. That ties in w/ your second question as I picture the use of the |
@geoffjentry, yes. @brucehoff, it could very well be named To clarify, we also pointed out in the hackathon that @geoffjentry does the paragraph above align with your thoughts? |
just fixing my tagging, wants @philloooo to chime in as well :) |
|
@mattions I think so. I can totally see the case where access to the API and access to the file aren't the same and wouldn't be against setting up a structure for that (but perhaps w/ simplifying defaults, e.g. if just NB I'm not advocating for that structure, so if others disagree so be it. But at least it doesn't bug me :) |
@mattions when you said "yes" was that about the presigned URL question or the authz part? If the presigned URLs, I didn't feel like we had reached consensus that presigned URLs should be required. If I'm wrong, ignore me :) |
Can I suggest for the response to:
That we do something like this:
Specifically, I would leave the aliases array out of this response and make that a different endpoint. I'm not sure theres a use case for it here and it adds a potentially expensive join. |
@zflamig Part of the idea here was to specifically separate out the download URL from the metadata, as per discussions at the face to face earlier this week |
Thats only the metadata for the URI @geoffjentry to help in resolving it. The original proposal includes this but it isn't clean or extensible. If a new URI appears we would have to amend the spec to add a new class for it... |
For the other cases, if any, where the returned @zflamig the structure of @geoffjentry I think @zflamig referred to @zflamig about the aliases part, do you support the use cases where aliases can be used for querying? E.g |
I second on having a |
Thanks for starting this discussion @sarpera . A few comments, some of which were partially raised by others, but I'm not sure I understand where they landed:
a) the first method ( b) the second method ( c) As an optional shortcut, if the DRS implementor chooses, they can include an URL_TO_BYTES uri in individual records returned by the first method, which lets callers skip a step. Implementors would be likely to do that for public content, and for content that can be accessed with a previously-obtained auth token (e.g. gs: and s3:), and unlikely to do that for signed URLs. But we don't have to bake that knowledge into the protocol or the calling code -- the rule for callers is "always use the URI to get bytes; if you don't get one from the first method, ask for one using the second method".
Minor points -- these can wait until we get the big picture settled:
|
@sarpera Ahhhh. I see what you were trying to accomplish now. I am okay with your method if we break it up and use the protocol as the high level container. So like
This way we can make it easier for future additions... if you have a new protocol you are free to require whatever metadata you want. @dglazer re: #7 on your list: the use case there is for data bundles, so you can have one DRS url that dereferences to a group of DRS urls. For example, during the cohort creation process a GUID/DRS entry may be minted to represent the data that the user selected. |
@zflamig thanks, it gets better with every iteration. +1 for your suggestion.
I haven't created a PR for exact the same reason you pointed out. And you're totally right the details won't be clear without the code changes. Just wanted to discuss with the group some more, as the suggestions are really helping so far.
Yes, the callers would pass an auth token when calling the URL_TO_BYTES. Nothing really changes for those who don't have this use-case. Ideally, the token would be the same across all the endpoints on a DRS server, even more ideally all DRS implementations would have the same means of obtaining it. There is a related issue.
I guess what I was trying to say was along these lines:
1- If the some of the urls/access-points are ready to consume for getting bytes (public urls), there is no further action needed to call another endpoint. 2- For cloud URIs, the URIs returned from 3- Having 4- To sum up above, I do really agree with you that:
and
sounds more reasonable. That way, the implementors MAY choose not the include the URIs in
+1 for this. There could be separate auth policies for both cases, and implementors MAY choose to have the same policy for both if it fits them. I guess this is not against your point.
+1 for standard HTTP auth token or a bearer token. One of the possible cases would be that your authz privileges might have been revoked or expired, but you already obtained a token. In that case, the flow is solid anyway, you would get a 403 on
During the hackathon we were made aware that there are some implementors who will be using DRS as a data-registry service, without necessarily providing a direct access to bytes, but instead pointing out to a another DRS server (via a DRS url) where the data can be accessed. Sort of like "linked DRS"es or DRS of DRSes. @susheel could you please perhaps provide those uses cases?
Oh no, it was just a placeholder since I didn't want to type every other metadata fields, hence the
Same here, open for any ideas. The most difficult part of building anything is to name it =) |
From the GA4GH call today, @sarpera and @susheel discussed what happens with a DRS entry for an object when you call GET id/download... OK to not implement seems to be the consensus. Seems like we need to clarify how "/download" works for the various URI types @dglazer proposed get bytes URI, fetch bytes ID... for passing to the download method. so the ID -> URI @sarpera is going to take this ticket and make a PR that explores what he and David talked about today... sort out the URL and the download in a single PR. @dglazer @rishidev and I will work out a process to bring this and other PRs up to vote via the active drivers |
Need to clarify
|
Wrapping up so far Good to see that there is a general consensus on the main idea that accessing "object metadata" and "bytes to object" may be separate calls to DRS for the cases where an "action" is required to be performed to get access to bytes e.g passing an auth token to: sign a URL, generate url-to-bytes with credentials etc. By doing so, I guess we all agree that the schema should remain generic, flexible and understandable, yet providing programatically parsable responses for the clients with different needs and use-cases. With that in mind, I tried to combine our ideas together and here's the outcome: Object metadata: Object bytes: Examples GET Response: {
"object": {
"id": "foo",
"name": "bar.bam",
"size": "1234",
"urls": {
"s3": [
{
"uri": "s3://foo/bar.bam",
"region": "us-east-1",
"<access-method-id>": "s3-1"
}
],
"gs": [
{
"uri": "gs://<foo>/<bar>.bam",
"region": "us-west1",
"<access-method-id>": "gs-1"
}
],
"ftp": [
{
"uri": "ftp://foo.com/bar.bam"
}
],
"drs": [
{
"uri": "drs://foo.com/objects/<id>"
}
]
} GET Response: { "uri": "<uri-to-bytes>" } Let's break apart the suggested "urls": {
"<access-method>": [
{
"uri": "<string>",
"<access-method-specific-attr>*": "<value>"
}
]
} where
Questions Why have a key-value paired access methods? E.g: in the cloud scenarios, there is a huge added value of having the region information for a URI whereas for other Why the value of In the example below, "urls": {
"s3": [
{
"uri": "s3://foo/bar.bam",
"region": "us-east-1",
"<access-method-id>": "s3-us"
},
{
"uri": "s3://baz/bar.bam",
"region": "eu-central-1",
"<access-method-id>": "s3-eu"
}
]
} What if all the Example: "urls": {
"ftp": [
{
"uri": "ftp://foo/bar.bam"
}
]
} How to mint an Help needed with naming things! How to name Naming the suggested new path, currently "download" TODOs Will make a PR with the suggested changes reflected on the swagger schema. |
On the naming side I propose:
so the Urll will look like: So something like this:
will have the following allowed calls:
with If we have consensus, we can move this next with the PR |
[updated to rename
Incorporating my proposals, your example would look like: "access_methods": [
"s3": { # there's no uri, meaning the caller has to call /access before fetching bytes
"region": "us-east-1",
"access_id": "s3-us"
},
"s3": {
"region": "eu-central-1",
"access_id": "s3-eu"
},
"gs": { # callers can either fetch bytes directly from the access_uri or use the access_id to get a direct uri
"region": "us-west1",
"access_uri": "gs://foo/bar.bam",
"access_id": "gs-1"
},
"ftp": { # there's no access_id, meaning the caller has to fetch the bytes directly
"access_uri": "ftp://foo.com/bar.bam"
}
] |
Thanks @dglazer, I'm also happy to see that PR I'm setting was actually pretty close to your input.
I agree. Already used
I agree. In my git local changes I already set to it be
After diving into the yaml code, I also figured having an array will make things a bit easier to describe via swagger v2.0. Also, opens future possibilities to make the items in the array more searchable in a uniform way. I'm swaying away from
I agree. Already used
As you mentioned, there are use cases (perhaps rare) that you might have both GET Retrieve a URL to access bytes of an (controlled-access) Object Response: {
"url": "string"
} GET Response: {
"object": {
"id*": "string",
"name*": "string",
# ... rest of the properties
"access_level*": "open | controlled",
"access_methods*": [
{
"uri*": "string",
"access_id*": "string",
"cloud_metadata": {
"region*": "string",
"provider*": "string"
},
"protocol*": "string"
}
]
}
} Some notes:
|
Great discussion. It is rewarding to see this work move forward. Consumers who need to answer 'what data is closest to me?' or 'where should I execute this pipeline?' can leverage the provider/region properties to answer these and other auction use cases. Long term, I'm convinced these use cases will lower cost. How can we encourage implementors to populate these fields? As I understand the schema, an implementor could conform to the spec and never populate them. i.e. Should 'cloud_metadata' be mandatory for certain access methods [s3, gs,...]? Also, I'm assuming the checksums object is part of '# ... rest of the properties' ? Forgive me if I've missed it, but are there formal dependencies to a (probably separate) Search Service to query this data? BTW, I always thought that urls was misnamed, nice to see it morph to access_methods. |
Thanks for all the feedback!
Exactly, this is what drove us initially to use a defined language to describe the access methods.
One way to go for it is to have strongly typed schema model per an access method and enforce required params thereof. Schema model for those Individual access methods should organically evolve when we get more use-cases iterated over in time.
Yes. Wanted to skip details since there is an issue for that already. @zflamig @philloooo @susheel The strong case, at least for us, to push for
It would help greatly if you could provide a complete use case for non-cloud private data URLs.
Agreed. Since we seem to go for strongly typed access methods, we could make cases for cloud-related methods to provide these information. Seven Bridges and @zflamig also have use-cases for explicit Schema
"access_methods": <AccessMethod>[] where an {
"<x>": <xAccessMethod>
} DRS defines the values for Example {
"s3": {
"uri*": "string",
"access_id": "string",
"region*": "string",
"provider": "string",
"allowed_regions": [
"string"
]
}
} {
"drs": {
"uri*": "string"
}
} {
"ftp": {
"uri*": "string"
}
} Example response of an object: {
"object": {
"id": "1234",
"name*": "bar.bam",
# ... rest of the properties
"access_methods": [
{
"s3": {
"uri": "s3://foo/bar.bam",
"access_id": "s3-1",
"region": "us-west-1",
"provider": "s3.amazonaws.com",
"allowed_regions": [
"us-west-1", "us-east-1"
]
}
},
{
"gs": {
"uri": "gs://foobaz/bar.bam",
"access_id": "gs-1",
"region": "us-central1",
"allowed_regions": [
"us-central1"
]
}
},
{
"ftp": {
"uri": "ftp://foo.org/baz/bar.bam"
}
},
{
"drs": {
"uri": "drs://some-other-drs.org/9876"
}
}
}
]
} Initial idea of having strongly typed access methods seems to be favoured by most of us. Please note that with this approach, in order to add a new access method we'd need to define it and update the schema. It is of course expected to have more properties in said Thoughts? |
What is Why What is the difference between For the I am thinking about the client's decision matrix. I think we want a tuple of For the case where the DRS server can hand out a signed URL, it should indicate that (by filling in access_id?) For the private access case, the client can have a table of credentials that correspond to various combinations of |
@sarpera For the {
"method*": "ftp",
"provider*": "string"
"uri*": "string",
"region": "string",
"contact": "string"
} Fully realised example: {
"method": "ftp",
"provider": "ftp.ebi.ac.uk"
"uri": "ftp://anonymous:[email protected]/dataset/path/file",
"region": "null",
"contact": "Contact John Doe <[email protected]>"
},
{
"method": "ftp",
"provider": "ftp-private.ebi.ac.uk"
"uri": "ftp://ftp-private.ebi.ac.uk/dataset/path/file",
"region": "ebi-hh",
"contact": "Contact Jane Doe <[email protected]>"
} I'm guessing it would be the same for |
@sarpera Do you see the possibility of having a {
"method*": "local",
"provider*": "string"
"uri*": "string",
"region": "string",
"contact": "string"
} Fully realised example: {
"method": "local",
"provider": "ebi-cluster.ebi.ac.uk"
"uri": "file://public/path/file",
"region": "ebi-hx",
"contact": "Contact John Doe <[email protected]>"
},
{
"method": "local",
"provider": "ebi-yoda.ebi.ac.uk"
"uri": "file://private/path/file",
"region": "ebi-hh",
"contact": "Contact Jane Doe <[email protected]>"
} |
Buckets can be set to incur outbound (egress) costs outside of its region in the same cloud provider. This provides more information in the decision making process to pick the most appropriate mirror of the file. Perhaps not the best name for the attribute though.
The former allows to define a schema model per access method so that method-specific attributes can be defined and enforced for consistency. Happy to discuss if the same goal can be achieved in a different way.
@dglazer also made some points about it. There may be cases where for a specific access method URI may not give any means of access e.g a file residing in a VPC and the only means of providing access to third-parties is signing a URL via Please also note that the cloud data owners may not want to (or be allowed to) expose their bucket names in the URIs, but may provide access via We can pursue some additional capabilities where; while keeping
This is a very important point and setting the individual access method attributes by aiming that goal would help us achieve that. I hope this aligns with your second question and answer I tried to provide.
Yes, exactly. Having that dedicated path
Could you please explain this a bit more with examples? Are you talking about discoverability of the available access methods based on existing client conditions? @susheel thanks for the examples.
Based on the previous schema definitions I provided, each defined access method would have its own attributes based on its needs. So region would be null for ftp cases. Similarly if the access_methods: [
{
ftp: {
"uri": "ftp://anonymous:[email protected]/dataset/path/file",
"contact": "Contact John Doe <[email protected]>"
}
},
ftp: {
"uri": "ftp://ftp-private.ebi.ac.uk/dataset/path/file",
"contact": "Contact John Doe <[email protected]>"
}
}
]
|
@sarpera For
Having a
|
@susheel making filtering easier by using So updated example would be: access_methods: [
{
ftp: {
"uri": "ftp://anonymous:[email protected]/dataset/path/file",
"contact": "Contact John Doe <[email protected]>",
"provider": "ftp.ebi.ac.uk"
}
},
ftp: {
"uri": "ftp://ftp-private.ebi.ac.uk/dataset/path/file",
"contact": "Contact John Doe <[email protected]>",
"provider": "ftp-private.ebi.ac.uk"
}
}
] I feel like
Is |
Seems like current version of OpenAPI doesn't allow patternProperties as of 3.0.2. Unless we want to hardcode all available access methods (ftp. http, s3 etc) in the schema and pair them with a schema model (array of objects), the above approach won't work in practice.
Going back to this approach, seems like with open api v3 this can be achieved while still enforcing certain properties per access method (region for cloud methods etc) by making use of I'd like to recap our requirements so far about the access methods before adjusting them to v3.0.
And so far our use cases for an access method are:
Required / Optional bare minimum properties are different:
Based on that, examples would look like: Open access data: "access_methods": [
{
"uri": "ftp://foo.example.com/file.name",
"method": "ftp",
"provider": "foo.example.com"
},
{
"uri": "s3://foo-open-bucket/file.name",
"method": "s3",
"provider": "s3.amazonaws.com",
"region": "us-east-1"
}
], Controlled/private access data: "access_methods": [
{
"uri": "ftp://bar.example.com/file.name",
"method": "ftp",
"provider": "foo.example.com",
"contact": "[email protected]"
},
{
"access_id": "s3-1",
"method": "s3",
"region": "us-east-1"
},
{
"uri": "drs://foo.example.com/123",
"method": "drs"
}
], Enforcing required/optional properties can be done via AccessMethods:
type: array
description: The list of access methods that can be used to access the Data Object.
minItems: 1
items:
anyOf:
- $ref: '#/components/schemas/StaticAccessMethod'
- $ref: '#/components/schemas/CloudAccessMethod'
- $ref: '#/components/schemas/ActionableStaticAccessMethod'
- $ref: '#/components/schemas/ActionableCloudAccessMethod'
discriminator:
propertyName: method
ActionableAccessMethod:
type: object
required:
- access_id
properties:
access_id:
type: string
ActionableCloudAccessMethod:
type: object
allOf:
- $ref: "#/components/schemas/ActionableAccessMethod"
- type: object
required:
- region
- method
- access_id
properties:
uri:
type: string
method:
type: string
enum:
- s3
- gs
region:
type: string
description: >-
Name of the region in the cloud service provider that the object belongs to.
example:
us-east-1
provider:
type: string
CloudAccessMethod:
type: object
required:
- uri
- region
- method
properties:
uri:
type: string
provider:
type: string
method:
type: string
enum:
- s3
- gs
region:
type: string
description: >-
Name of the region in the cloud service provider that the object belongs to.
example:
us-east-1
ActionableStaticAccessMethod:
type: object
allOf:
- $ref: "#/components/schemas/ActionableAccessMethod"
- $ref: "#/components/schemas/StaticAccessMethod"
StaticAccessMethod:
type: object
required:
- uri
- method
properties:
method:
type: string
enum:
- ftp
- sftp
- http
- https
- nfs
- globus
- aspera
- gsiftp
- nfs
- local
uri:
type: string
provider:
type: string
contact:
type: string |
@sarpera Thanks for investigating the OpenAPI spec compatibility. I agree with @tetron it would have been cleaner, but I guess we will have to live within our means! :) I thought we'd discussed (maybe not agreed) that we will be more explicit with the Controlled/private access data: "access_methods": [
{
"access_id": "drs://server.com/access/s3-1",
"method": "s3",
"region": "us-east-1"
}
{
"access_id": "http://server.com/get-object/s3-1",
"method": "s3",
"region": "us-east-1"
}
], Which I hope will work for your use case when it is provided by the DRS service, and when it may be provided by a third-party service. P.S. If this is acceptable, why have it called |
@susheel In your example, {
"access_id": "drs://server.com/access/s3-1",
"method": "s3",
"region": "us-east-1"
} Please note that Keeping that in mind, Also it's ambiguous what token value for If with DRS of DRSes we are aiming to redirect the client to another, this is indirect but not ambiguous: {
"uri": "drs://server.com/<object_id>",
"method": "drs"
} Alternatively, we could utilise the alias property of an object, to link/mirror another DRS URLs. {
"id": 123,
"name": "foo",
"checksums": ["# list here"],
"access_methods": ["# list here"],
# rest of the props
"alias": ["drs://server.com/<object_id>"]
} |
|
As discussed in #230, updating to OpenAPI 3.0 may take longer than we'd like, and I'm eager to get the changes discussed here into a PR. So it may make sense to decouple the issues, open up a PR for this issue now doing the best we can using v2, and then revisit whenever #230 is resolved. I think that will be fine -- we can still use the I picture something like (without having tested it): AccessMethods:
type: array
description: The list of access methods that can be used to access the Data Object.
minItems: 1
items:
$ref: '#/components/schemas/AccessMethod'
AccessMethod:
type: object
required:
- method
properties:
method:
type: string
enum:
- s3
- gs
- ftp
- sftp
- http
- https
- nfs
- globus
- aspera
- gsiftp
- nfs
- local
access_url:
type: string
description: >-
A fully resolvable HTTP address that can be used to GET the actual object bytes.
Note that at least one of access_url and access_id must be provided.
access_id:
type: string
description: >-
An arbitrary string to be passed to the /access method to fetch an access_url
region:
type: string
description: >-
Name of the region in the cloud service provider that the object belongs to.
example:
us-east-1 @sarpera -- wdyt? Are you up for creating a PR using OpenAPI v2 now, and confirming it's not too ugly? |
@dglazer Yes, almost. I do agree that DRS must be able to support the two-phase access mechanism. If the DRS server only provides an Either way, I agree with @sarpera we need to also iron out how AUTH tokens are specified and passed to |
@susheel , it sounds like we largely agree on the two options; good. @sarpera , I suggest you pick one (you know my vote), put it into the PR, and then we can discuss and finalize there. A few comments on the details
|
@dglazer If there isn't a use-case for only providing an Happy to commit to whichever way the community would like to proceed. |
PR is made. Kept is as simple as possible for the initial merge. Points not covered in PR:
|
Issue #213: Access methods and access path for an object
Background
Following the discussion we had at the GA4GH hackathon in January
we would like to propose to have a method to get the metadata of an object,
and then have an additional method which will provide the download of the
object.
The rationale to have two methods instead of one, is due to the necessity
to sign the object using the authorisation token provider (right now this
is based on the OIDC specs), which is expensive computationally to do.
More over, with the presence of regions and provider, a DRS client will
be able to decide which provider and which region would be best to obtain
the file, among all the possible URIs.
The format we propose are:
-
objects/<id>/
for getting the object metadata-
objects/<id>/download
for getting the object bytesand we propose to pass the authorisation token in the Request Header
to get access to the object.
This is the flow, from a DRS client point of view:
GET
/objects/<id>
GET
/objects/<id>/download
with Request HeaderX-DRS-TOKEN: <TOKEN>
The token is obtained by the client from the DRS server, and it is up to the DRS Server
implementer to decide how a user will obtain that.
Object metadata Request
This will return the object metadata:
HTTP Request
HTTP Response
The client will be able to pick one of the
cloud
uri and requestthe download uri, passing the token
Object download Request
HTTP REQUEST
HTTP Response
The return value is a URI where a GET request will give you the bytes:
a
GET <URL_TO_BYTES>
will start the download of the file.The text was updated successfully, but these errors were encountered: