Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Bundle Types #86

Closed
wants to merge 9 commits into from
Closed
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 90 additions & 0 deletions rfcs/text/0000-bundle-types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
### DCP PR:

***Leave this blank until the RFC is approved** then the **Author(s)** must create a link between the assigned RFC number and this pull request in the format:*

`[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/<PR#>)`

# RFC: Bundle Types

## Summary

Formalizes data bundle types, definitions, target users/consumers, and specific use cases to improve clarity and
confidence among DCP developers when working with the DCP data model.

## Author(s)

* [Mallory Freeberg](mailto:[email protected])
* [Brian Hannafious](mailto:[email protected])
* [Hannes Schmidt](mailto:[email protected])
* [Andrey Kislyuk](mailto:[email protected])

## Shepherd
***Leave this blank.** This role is assigned by DCP PM to guide the **Author(s)** through the RFC process.*

*Recommended format for Shepherds:*

`[Name](mailto:[email protected])`

## Motivation

There are currently no documented definitions for what a "bundle" is in the HCA DCP, including no definitions of what
they are, what they contain, or who they support. Lack of clarity causes confusion for DCP developers and data consumers
about how the data and metadata in the DCP are structured and organized.

### User Stories

1. As a tool developer, I would like to identify and get raw data in the DSS so that I can run my data processing and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a syntax note, numbering is not incrementing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimothyTickle that is intentional. Markdown supports automatic numbered list incrementing, so that items can be re-arranged more easily. The rendered version increments the list properly.

analysis tools.

1. As a tool developer, I would like to identify and get alignment results and count matrices in order to produce
expression matrices.

1. As a computational biologist, I would like to get expression matrices for the Immune Cell Atlas project so that I can
do my research.

1. As an external user of the DSS who is developing analysis tools, I would like to browse, query, and get raw and
processed data based on some rough criteria so that I can understand what the data are and develop my tool.

1. As an Azul or Query Service developer I want to be able to distinguish between different types of bundles in order to
know which bundle to index, link, etc. or to provide bundle types to users as a search criterion.

1. As a computational biologist, I would like to know where the reference data is.

## Detailed Design

A new field, `bundle-type`, will be introduced to the DSS bundle manifest (the response to `GET bundle` in DSS). The
field value is a string formatted in a similar manner
to [DCP Media Types](https://docs.google.com/document/d/1TqihrgXjct9aDmTJO52_gE2WlpFysB1OkG9C8exmWTw) and consistent
with the syntax defined in [RFC 7231](https://tools.ietf.org/html/rfc7231#section-3.1.1.1). DSS must require the bundle
type to be specified when creating a new bundle, but should not enforce the vocabulary. DSS must allow subscriptions
and search results to be predicated on bundle type using the same string predicates available with other metadata and
manifest string fields.

### Registry of Bundle Types

Introduce a registry of bundle types to be maintained by the Metadata team. The initial contents of the registry are:

| Bundle Type | Example `bundle-type` value |
|--------------------------|-----------------------------------------------------------|
| Project Metadata Bundle | `hca/project` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my understanding that "Project Metadata Bundle" isn't just classifying an existing type of bundle, but introducing new bundling, right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. A bundle idea that has been kicked around for a long time. Nobody has tried to justify, tightly define or implement it yet to my knowledge.

| Primary Sequence Bundle | `hca/primary-data; hca-data-type:SmartSeq2` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is hca-data-type here equivalent to the library_construction_method ontology defined in the submission?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I just saw the discussion about this field here: #86 (comment)

| Primary Imaging Bundle | `hca/primary-data; hca-data-type:sptx` |
| Secondary Analysis Bundle| `hca/analysis-output; hca-pipeline:snap-atac` |
| Reference Resource Bundle| `hca/analysis-support; hca-resource-type:reference-genome`|

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reasoning behind naming this bundle-type value hca/analysis-support? Something like hca/analysis-reference seems clearer to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samanehsan your suggestion sounds good to me, adopting. Thanks!

| Expression Matrix Bundle | `hca/analysis-output; hca-resource-type:expression-matrix`|

### Unresolved Questions

- Bundle degrees: Useful? Who assigns them? Who uses them?
- Is there actually added value to not having to update all bundles when e.g. the project is updated?
- Versioned/unversioned references?

### Prior Art

[Submission to bundles SOP](https://docs.google.com/document/d/1x8mYLU8ubpZtTrzkwJrft1heqReX-pJLjJfb2X0fd2w) (Apr 2018)
[Bundle metadata schema update ticket](https://github.com/HumanCellAtlas/metadata-schema/issues/986)

### Alternatives

- Status Quo: no well-defined bundle types. Existing DCP components continue to infer bundle types via ad hoc methods
and included metadata.