RFC: Bundle Types #86

kislyuk · 2019-07-15T23:36:40Z

This RFC depends on #93 and further defines the details of how bundle types will be represented in the DSS.

Last call for community review: Sept 6

rfcs/text/0000-bundle-types.md

brianraymor · 2019-08-30T01:19:52Z

I'm confused by This RFC depends on #93 and further defines the details of how bundle types will be represented in the DSS. when #93 is not approved. That's like taking a dependency in one standard on another which is in draft. It also states Last call for community review: August 30 - was this announced on dcp slack?

kislyuk · 2019-08-30T14:48:13Z

The author of #93 has left the project, so I'm going to take over authorship of that RFC.

Announced on the #dcp channel and extended the community review by a week.

rfcs/text/0000-bundle-types.md

jahilton · 2019-09-03T17:19:51Z

rfcs/text/0000-bundle-types.md

+
+| Bundle Type              | Example `bundle-type` value                               |
+|--------------------------|-----------------------------------------------------------|
+| Project Metadata Bundle  | `hca/project`                                             |


It's my understanding that "Project Metadata Bundle" isn't just classifying an existing type of bundle, but introducing new bundling, right?

Yes. A bundle idea that has been kicked around for a long time. Nobody has tried to justify, tightly define or implement it yet to my knowledge.

samanehsan · 2019-09-06T19:41:38Z

rfcs/text/0000-bundle-types.md

+| Primary Sequence Bundle  | `hca/primary-data; hca-data-type:SmartSeq2`               |
+| Primary Imaging Bundle   | `hca/primary-data; hca-data-type:sptx`                    |
+| Secondary Analysis Bundle| `hca/analysis-output; hca-pipeline:snap-atac`             |
+| Reference Resource Bundle| `hca/analysis-support; hca-resource-type:reference-genome`|


What was the reasoning behind naming this bundle-type value hca/analysis-support? Something like hca/analysis-reference seems clearer to me

@samanehsan your suggestion sounds good to me, adopting. Thanks!

samanehsan · 2019-09-06T19:46:41Z

rfcs/text/0000-bundle-types.md

+| Bundle Type              | Example `bundle-type` value                               |
+|--------------------------|-----------------------------------------------------------|
+| Project Metadata Bundle  | `hca/project`                                             |
+| Primary Sequence Bundle  | `hca/primary-data; hca-data-type:SmartSeq2`               |


Is hca-data-type here equivalent to the library_construction_method ontology defined in the submission?

Nevermind, I just saw the discussion about this field here: #86 (comment)

NoopDog · 2019-09-11T08:22:15Z

It seems to me that Mallory raises legitimate concerns of duplicating information that is in the metadata in the bundle type schema. I understand that it can be useful to denormalize/duplicate for efficiency and ease of querying but perhaps we should instead try to make the metadata in the bundles more queryable rather than duplicating metadata and the metadata schema in the this bundle type schema.

I feel we might have more flexibility for the future if we kept to a model based on the elegant process/protocol concepts that already exist in the metadata schema.

Rather than explicitly defining new bundle types we could, for example, have a single bundle type, "process" that is an instance of a protocol that is defined in the metadata standard with all the benefits of the metadata type system. Then everything becomes a DAG of processes with the outputs of one process being the inputs of another, downstream process.

We would have to create some additional processes to reflect some DCP specific activities such as project creation, and more accurately model the library prep/run relationship for example but in the end we would be able to query by process types specified by the metadata schema instead of querying by a 2nd type system (the bundle type system) and having to map between the two systems.

The main idea here is that an "primary sequence bundle" is the output of a sequencing process (an instance of a sequencing protocol).

A "secondary analysis bundle" is the output of a secondary analysis process (instance of a secondary analysis protocol with a specific subtype "Optimus Prime Version x.xx").

For example if you have a process defined (roughly ) as:

PROCESS BUNDLE

process_instance_fquid
protocol_fqid
input_files[] (references , if any)
output_files[] (the actual files)
process_parameters{}
biomaterial_metadata (donor, collection, handling, prep)

This is similar to what we have now and what is proposed, but avoids creating a bundle type schema and lets one subscribe to notifications on protocol_fqid. (or modify to include the protocol name for humans)

This also allows arbitrary "data groups" or "data sets" by defining a "grouping" protocol and creating a "Process Bundle" with a grouping protocol fqid and a list of input file references.

TLDR: Keep bundles semantics free, model analysis as a process/protocol, query by process type instead of bundle type.

kislyuk · 2019-09-11T16:25:02Z

@NoopDog in an abstract DSS operating in a vacuum, what you propose makes sense. The problem is that operating without bundle types makes it eternally unclear what types of bundles applications should care about. In your example, how does an application implementer figure out which bundles to get?

In my opinion, after designing and operating Query Service, which attempts to make raw metadata directly queryable, making process/protocol metadata the sole source of information without allowing services to de-normalize it is a detriment to the usability of the system. The process/protocol graph requires sophisticated graph querying techniques to extract information from. It is an over-complexified information representation method for our experimental data.

diekhans · 2019-09-11T17:41:44Z

It is unclear why there are two RFCs related to bundle (#86 and #93).

They are not clearly delineated into covering different aspects of bundle types. I believe we would be better off merging these two.

barkasn

I second the above the two RFCs should be merged

kislyuk · 2019-09-11T18:15:58Z

@diekhans @barkasn #93 addresses the design while #86 addresses implementation details. In my estimation, they can proceed separately, but I don't mind them being merged/unified.

I don't have the bandwidth to unify the two RFCs right now. Can someone else help me with the unification process?

diekhans · 2019-09-11T19:42:16Z

The authors of the two RFCs are the same, so if none of the authors have the bandwidth to work on them and they are interdependent. If #86 is the implementation of #93, then it #86 will be blocked by #93.

Really, I think the fastest short term approach also has the best long term value, which is to move the relevant part of #93 into #86 and then drop #93.

kislyuk · 2019-09-11T20:46:10Z

Great, thanks @diekhans. When you have some time, please go ahead and merge #93 into this RFC and close #93.

diekhans · 2019-09-12T16:27:10Z

I am not suggesting I have the bandwidth. Only that the two RFCs have the same authors and they could save themselves and the reviews time by combining them.

barkasn · 2019-09-12T18:47:22Z

@kislyuk I won't be able to help with the unification process either. I don't want to block you from proceeding with your work on this RFC so I'll mark as approved, however I see a number of problems here that I think should be resolved and I would suggest you approach the respective POs and TL:

It is not clear how the bundle types will be defined, the process to do so needs to be defined in this bundle
The bundle types listed in the two RFCs are nearly orthogonal to each other (one defining level of processing and the other data type), there is no clear scope for what the bundle types ought to be
It would be highly advantageous if the bundle types re-used an existing ontology or at least mapped several existing ontologies into bundle type ids.
A clear way for hierarchical extensibility of the bundle types should be presented and a consideration of how this would impact DSS search should be made

Finally, I would like to point out that the rfc approval level tag needs to be updated

lauraclarke

I would like more discussion and a clear understanding of what is being requested with the bundle type definition registry. Charter responsibility and what is proposed in this RFC are in conflict

rfcs/text/0000-bundle-types.md

TimothyTickle · 2019-09-14T00:14:08Z

rfcs/text/0000-bundle-types.md

+
+### User Stories
+
+1. As a tool developer, I would like to identify and get raw data in the DSS so that I can run my data processing and


Just a syntax note, numbering is not incrementing.

@TimothyTickle that is intentional. Markdown supports automatic numbered list incrementing, so that items can be re-arranged more easily. The rendered version increments the list properly.

TimothyTickle

It feels like this RFC simply formalizes de facto uses of bundles as formal bundles. If that is the intent, the summary should be updated.

Currently, the goal from the summary is "Formalizes data bundle types, definitions, target users/consumers, and specific use cases to improve clarity and confidence among DCP developers when working with the DCP data model.", and it feels like there is more information needed to hit that mark. What is missing for me is why are the bundles in the RFC correct? What criteria define a bundle type, how do we know when we need a new bundle, and when that occurs how is it defined? I do not understand from this document, for instance, why we would have a secondary analysis bundle and an expression matrix bundle (not to argue they should not exist but "why" is missing). If these are all correct, what specific need do each address? As I read the summary, it alludes to potentially target users and use cases may be the key to understanding bundles but that reasoning is not shown. If this is the belief of the author, those driving factors would hopefully be distilled into rules that help better define bundles and help us understand when we need more or when we reuse what we have. Without this, I do not believe I would have confidence in applying them to new problems (beyond of course just doing status quota, which does not seem to be the intent).

I am not interested in dictating the scope of an RFC, I think it is healthy to be able to scope work in this space as the authors and implementation team need but I am concerned that the dependent RFC does not seem complete, relying on those definitions here seems premature.

This is crucial work and the effort of the authors is much appreciated.

diekhans

This is a very important RFC and deserves much more attention. This and RFC PR 93 need be combined, as they are overlapping and have contradictions. Unfortunately, the authors don't have the bandwidth to polish this work.

Requested changes:

This RFC should start by defining what a bundle is, both functionally and as an architectural concept. This is something that is confusing to both the DCP developers and consumers. We have approved RFC 0010 that states "... the meaning of bundle is not stable or clearly defined at the moment." This RFC is the perfect opportunity to remedy this ongoing issue.

Several entries in initial contents of the registry are speculative. We should only include in the actual registry when they are defined by other RFC. These are good as examples.

The actual implementation of the registry should either be defined or indicated as an unresolved issue.

The process for updating the registry should be defined. Does it require an RFC or simply a pull request adding a new type?

This RFC is dependent on a document that appears to be a proposed RFC: "Request for Comments: Media Type usage in the Data Coordination Platform" and seems to be stalled. This needs to be made into an RFC. It seems well written and this should be straight-forward. This can be made non-blocking to this RFC by making it an unresolved issue and creating a ticket.

The relationship of the "Unresolved Questions" items to this RFC is not clear.

kislyuk · 2019-09-16T15:38:52Z

It feels like this RFC simply formalizes de facto uses of bundles as formal bundles. If that is the intent, the summary should be updated.

Updated the summary to include "de facto".

Currently, the goal from the summary is "Formalizes data bundle types, definitions, target users/consumers, and specific use cases to improve clarity and confidence among DCP developers when working with the DCP data model.", and it feels like there is more information needed to hit that mark. What is missing for me is why are the bundles in the RFC correct? What criteria define a bundle type, how do we know when we need a new bundle, and when that occurs how is it defined? I do not understand from this document, for instance, why we would have a secondary analysis bundle and an expression matrix bundle (not to argue they should not exist but "why" is missing). If these are all correct, what specific need do each address? As I read the summary, it alludes to potentially target users and use cases may be the key to understanding bundles but that reasoning is not shown. If this is the belief of the author, those driving factors would hopefully be distilled into rules that help better define bundles and help us understand when we need more or when we reuse what we have. Without this, I do not believe I would have confidence in applying them to new problems (beyond of course just doing status quota, which does not seem to be the intent).

Added sections on this under Detailed Design, please take a look.

I must say I don't know how to specifically/precisely answer your question about why the bundle types are "correct". To my estimation and as you mentioned, the contents of this RFC reflect the de facto data type hierarchy that we're handling today. I tried to convey an operational justification (applications need to pattern match data for their input) and a heuristic for when new types are needed.

kislyuk · 2019-09-16T16:11:56Z

This RFC should start by defining what a bundle is, both functionally and as an architectural concept. This is something that is confusing to both the DCP developers and consumers. We have approved RFC 0010 that states "... the meaning of bundle is not stable or clearly defined at the moment." This RFC is the perfect opportunity to remedy this ongoing issue.

I added a section under Detailed Design, What is a bundle, which attempts to provide an operational definition. Please take a look.

Several entries in initial contents of the registry are speculative. We should only include in the actual registry when they are defined by other RFC. These are good as examples.

To my knowledge, all of the entries in the table either reflect data types in real bundles or reflect data types in outputs of applications whose operators would like to deposit data into the DCP, but currently cannot because the DCP hasn't given them specific guidance on how to do so. If you have specific objections or corrections, can you please point them out and suggest alternatives?

The actual implementation of the registry should either be defined or indicated as an unresolved issue.
The process for updating the registry should be defined. Does it require an RFC or simply a pull request adding a new type?

Added a paragraph under Registry of bundle types, does that level of detail look sufficient? I went with the simpler option that you propose.

This RFC is dependent on a document that appears to be a proposed RFC: "Request for Comments: Media Type usage in the Data Coordination Platform" and seems to be stalled. This needs to be made into an RFC. It seems well written and this should be straight-forward. This can be made non-blocking to this RFC by making it an unresolved issue and creating a ticket.

That document predates the formal DCP RFC process and as such is not subject to the requirements of our RFC process (just like other external documents cited by our other RFCs). It is not "stalled" - instead, it was agreed upon and implemented.

However, for completeness and to avoid confusion, I imported it into Markdown and opened a PR here: #113.

The relationship of the "Unresolved Questions" items to this RFC is not clear.

Thank you, that was an editing oversight. Removed those questions and added a couple to reflect the discussion above.

kislyuk · 2019-09-16T16:32:25Z

Everyone, thank you for the productive comments here. Due to numerous requests, I carved out some time and merged the contents of #86 into #93. I'm closing this PR, let's pick up the discussion in #93. Please quote relevant snippets of comments when discussing. Thank you!

kislyuk added 3 commits July 11, 2019 14:50

WIP

82b1302

Wording fixup

4734480

Add Query Service use case

c53a3ce

mweiden reviewed Jul 19, 2019

View reviewed changes

rfcs/text/0000-bundle-types.md Outdated Show resolved Hide resolved

kislyuk added the rfc-community-review label Jul 30, 2019

kislyuk marked this pull request as ready for review July 30, 2019 21:34

kislyuk added the Architecture label Aug 13, 2019

justincc self-requested a review August 15, 2019 17:50

justincc approved these changes Aug 15, 2019

View reviewed changes

xbrianh approved these changes Aug 15, 2019

View reviewed changes

jahilton reviewed Sep 3, 2019

View reviewed changes

samanehsan reviewed Sep 6, 2019

View reviewed changes

NoopDog self-requested a review September 11, 2019 08:22

NoopDog mentioned this pull request Sep 11, 2019

RFC: HCA DCP Bundle Types and Definitions #93

Closed

barkasn self-requested a review September 11, 2019 17:47

barkasn suggested changes Sep 11, 2019

View reviewed changes

kislyuk added 3 commits September 11, 2019 11:11

Add shepherd

b4d5f5e

Expand reference bundle user story to clarify

43e0d86

Rename analysis-support to analysis-reference

4d0a588

barkasn approved these changes Sep 12, 2019

View reviewed changes

lauraclarke requested changes Sep 13, 2019

View reviewed changes

rfcs/text/0000-bundle-types.md Outdated Show resolved Hide resolved

TimothyTickle reviewed Sep 14, 2019

View reviewed changes

TimothyTickle requested changes Sep 14, 2019

View reviewed changes

diekhans requested changes Sep 14, 2019

View reviewed changes

Clarify bundle type registry ownership

2c4a65e

kislyuk requested a review from lauraclarke September 15, 2019 20:42

Address comments in review

6f3c135

kislyuk requested a review from TimothyTickle September 16, 2019 15:39

WIP

f6d2cad

kislyuk requested a review from diekhans September 16, 2019 16:12

kislyuk added a commit that referenced this pull request Sep 16, 2019

Merge #86 into #93

d968507

kislyuk closed this Sep 16, 2019

kislyuk deleted the akislyuk-rfc-bundle-types branch September 16, 2019 16:32

kislyuk removed the rfc-community-review label Sep 16, 2019

brianraymor added the rfc-withdrawn RFC is withdrawn from consideration label Sep 17, 2019

brianraymor mentioned this pull request Sep 17, 2019

Processing Datasets that Span Multiple Data Collection Runs #88

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Bundle Types #86

RFC: Bundle Types #86

kislyuk commented Jul 15, 2019 •

edited

Loading

brianraymor commented Aug 30, 2019

kislyuk commented Aug 30, 2019

jahilton Sep 3, 2019

justincc Sep 3, 2019

samanehsan Sep 6, 2019

kislyuk Sep 11, 2019

samanehsan Sep 6, 2019

samanehsan Sep 6, 2019

NoopDog commented Sep 11, 2019 •

edited

Loading

kislyuk commented Sep 11, 2019

diekhans commented Sep 11, 2019

barkasn left a comment

kislyuk commented Sep 11, 2019 •

edited

Loading

diekhans commented Sep 11, 2019

kislyuk commented Sep 11, 2019

diekhans commented Sep 12, 2019

barkasn commented Sep 12, 2019

lauraclarke left a comment

TimothyTickle Sep 14, 2019

kislyuk Sep 15, 2019

TimothyTickle left a comment •

edited

Loading

diekhans left a comment

kislyuk commented Sep 16, 2019

kislyuk commented Sep 16, 2019 •

edited

Loading

kislyuk commented Sep 16, 2019


		### User Stories

		1. As a tool developer, I would like to identify and get raw data in the DSS so that I can run my data processing and

RFC: Bundle Types #86

RFC: Bundle Types #86

Conversation

kislyuk commented Jul 15, 2019 • edited Loading

brianraymor commented Aug 30, 2019

kislyuk commented Aug 30, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NoopDog commented Sep 11, 2019 • edited Loading

kislyuk commented Sep 11, 2019

diekhans commented Sep 11, 2019

barkasn left a comment

Choose a reason for hiding this comment

kislyuk commented Sep 11, 2019 • edited Loading

diekhans commented Sep 11, 2019

kislyuk commented Sep 11, 2019

diekhans commented Sep 12, 2019

barkasn commented Sep 12, 2019

lauraclarke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TimothyTickle left a comment • edited Loading

Choose a reason for hiding this comment

diekhans left a comment

Choose a reason for hiding this comment

kislyuk commented Sep 16, 2019

kislyuk commented Sep 16, 2019 • edited Loading

kislyuk commented Sep 16, 2019

kislyuk commented Jul 15, 2019 •

edited

Loading

NoopDog commented Sep 11, 2019 •

edited

Loading

kislyuk commented Sep 11, 2019 •

edited

Loading

TimothyTickle left a comment •

edited

Loading

kislyuk commented Sep 16, 2019 •

edited

Loading