Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overall DRS scaling concerns [EPIC] #342

Open
briandoconnor opened this issue Jan 25, 2021 · 5 comments
Open

Overall DRS scaling concerns [EPIC] #342

briandoconnor opened this issue Jan 25, 2021 · 5 comments

Comments

@briandoconnor
Copy link
Contributor

briandoconnor commented Jan 25, 2021

This is a parent epic for gathering multiple conversations/threads/tickets about how to scale DRS and what dependencies this has on things like Passport and token expiration.

Related tickets:

@ianfore
Copy link

ianfore commented Feb 7, 2021

The key here is the question of whether logical structures should be handled with DRS as raised in #337. So dealing with that first.

This is explored in some detail in the SRA_IDs_and_bundling notebook.
Besides #337, it seems apparent that questions in #260 are covered.

That does not take away the need for handling lists of objects (#286, #334) and of pagination (#325) but suggest those should work simply as low-level/fundamental operations for lists. The conclusion from the notebook above that domain level meaning should not be placed on DRS as a base level protocol applies to other use cases and models.
#334 - ability to request a number of DRS ids is required
#286 - probably handled well enough if #337 is dealt with
#325 - pagination is required, both for bundles and bulk requests. Approach should be a standard one

File/Object type is required (see #319) both for bundles and simple objects.

Note that pagination into complex structures should involve proper consideration of the use cases. The idea that linear navigation through an image, or a genomic data structure reflects user need should be thought through. Random access into complex nested structures, determined by some user/application need will be common.
Illustration:
As an illustrative use case the ImagePyramids notebook begins to explore and image display use case where panning and zooming requires a directed "paging" through DRS objects. In this case a higher (domain) level model of image pyramids drives the use case. Low-level paging can support that without the need for the lower level to become complicated by the domain model. "Pagination" through DICOM images would require a different domain model but could equally be served by the same low-level protocols for the binary objects.

Further discussion:
Authors under #286 cited a wish to keep the clients simple. Two considerations seem relevant. Firstly, the clients are higher level applications which must deal with the specific complexities of use cases at the application level; i.e. over-simplification of clients would lose important application functionality. Second; the use of higher level logical model standards, which include DRS ids in a context which give them meaning, allow the clients/application to make use of those higher level models. The second provides some standardization, restoring some level of simplification to the application level client. The client can use the higher level standards and not have to invent them.

#323 may become moot. That a bundle is "like a directory" seems reasonable, but only as analogy. Expecting that bundles literally would be directories is too limiting. What is most important that meaningful structure exists for any aggregation of objects. Directory structures may or may not provide that depending on how the directories are used or maintained (usually fickle). The separation of the concern of physical level structure from that of domain meaning is one that has proven to be useful.

@dglazer
Copy link
Member

dglazer commented Mar 29, 2021

(edited 10-april to incorporate @ianfore's feedback from 31-march)
Following up on the discussion here and at last week's Cloud WS meeting (notes) -- I think we can address the vast majority of pressing issues with 3 PRs, one each for bulk data requests, logical collections of objects, and bundle clarification. First thoughts on each below -- once we agree on the framing, we can divide-and-conquer into separate PRs to nail down the details.

1) Bulk Data Requests

I believe this is the most semantically straightforward PR, and should be defined to be purely optional optimization for clients that want to retrieve large amounts of info via DRS. It should provide more efficient ways to access existing functionality, not provide net new functionality. This PR should include:

  • request: support for requesting info on multiple objects at once, starting with (and maybe ending with) a way to pass multiple ids to GET /objects/{object_id}.
  • return: support for the (to be defined together with the CSO and the TASC team) GA4GH-wide pagination syntax, for any API calls that can return large lists of responses

2) Logical collections of objects

I believe, per the WS meeting discussion, we can resolve this one with documentation changes only along the lines of:

  • assume you have a compound object made up of a logically related set of files (e.g. DICOM, image pyramid, BAM/BAI pair)
  • define a manifest (probably in JSON) that encodes that logical structure, with pointers to all the ‘leaf’ files
  • store each leaf file as a DRS blob, and store the DRS id of each leaf in the manifest
  • store the manifest as a DRS blob, and use the DRS id of the manifest to represent the object as a whole
  • do not (typically) pass around DRS ids to the leaf files -- they’re typically only useful in the context of the manifest

As part of this PR, we may want to also:

3) Bundle clarification

I believe this one will take more discussion, since I'm not sure I understand all the use cases for bundles. I suggest we get the other two PRs rolling before trying to nail this one down, since many of the current bundle use cases will no longer be needed.

@ianfore
Copy link

ianfore commented Mar 30, 2021

These sound like good buckets. Would add a couple of suggestions.

The first would be to shift the focus of the first PR to "support for requesting info on multiple objects at once". Pagination then becomes a need that derives from that.

Second, is a clarification of or changes to type (#319) as discussed above. This fits within the scope of the second PR - collections of objects. Suggest any work towards a PR on this one should include thorough familiarization with Research Objects RO-Crate. RO-Crate is used within the Elixir Driver Project.

@dglazer
Copy link
Member

dglazer commented Apr 10, 2021

Thanks @ianfore -- I incorporated both of your suggestions into my proposal above.

@briandoconnor
Copy link
Contributor Author

@briandoconnor briandoconnor changed the title Overall DRS scaling concerns Overall DRS scaling concerns [EPIC] Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants