Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: DSS Bundle Enumeration #101

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions rfcs/text/0000-dss-bundle-enumeration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
### DCP PR:

***Leave this blank until the RFC is approved** then the **Author(s)** must create a link between the assigned RFC number and this pull request in the format:*

`[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/<PR#>)`

# RFC: Bundle Enumeration of the DCP Data Store (DSS)

## Summary

This RFC proposes bundle enumeration endpoint(s) for the DSS.

## Author(s)

* [Brian Hannafious](mailto:[email protected])

## Shepherd
***Leave this blank.** This role is assigned by DCP PM to guide the **Author(s)** through the RFC process.*

*Recommended format for Shepherds:*

`[Name](mailto:[email protected])`

## Motivation

Currently, the only means of listing bundles stored in the DSS is the internal Elasticsearch (ES) metadata index, which
must be kept current with object storage. The DSS should provide bundle enumeration independently of the ES metadata index,
emphasizing consistency and scalability.

### User Stories

* As a downstream service developer, I would like to enumerate the bundle contents of the DSS so I can create my own
index.

* As a downstream service developer, I would like to check if my index contains all the bundles in the DSS.

## Detailed Design
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting something a bit more swagger-y rather than a narrative description in Detailed Design for a API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brianraymor It's not clear to me how to address your comment. Perhaps you have something in mind similar to the API endpoint descriptions found in the Deletion RFC.

However, those additions will not necessarily improve the clarity and actionability of this document, which defines a simple extension to the DSS API in language that I believe is clear to the developers who will implement it. Do you feel there are technical details missing, or that there is ambiguity that needs to be cleared up?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Deletion RFC more closely meets my expectations for a RFC as design document. Another approach is how Azure defines their REST APIs. Any reviewer/developer should be able to read this RFC and understand the DSS API in detail - not just the implementers. This currently reads more like what we called one pagers at Microsoft.


A new bundle enumeration endpoint, `GET /bundles`, will be introduced, taking replica and prefix parameters. These
xbrianh marked this conversation as resolved.
Show resolved Hide resolved
parameters will be used to return a paginated listing of bundles directly from object storage. Pagination semantics
and all other semantics of this route will be in line with the established conventions of the DSS API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without some kind of filtering, this seems like a very heavyweight endpoint to use. A couple of things come to mind:

  • Filter by update date/time. This would allow only obtaining bundles that have changes since the last time the query was run.
  • Filter by bundle type. Well, first we have to have bundle types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint is intended for heavyweight use by downstream indexers.

Also, an incremental approach seems preferable: if filtering becomes desirable in the future, it can be added to the endpoint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this is easy to add downstream. Assuming a full dump is what existing users want, then my speculative use is not a real use case ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diekhans you raise a good point, but as @xbrianh pointed out, the use case here is an unfiltered bulk pull of all bundle IDs for external indexing. We did look for a way to use a "lightweight" database to do filtering using our established filter language process (JMESpath), but didn't find any suitable database/indexing engine for such a task.

### Unresolved Questions