Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Bundle Types #86

Closed
wants to merge 9 commits into from
Closed

RFC: Bundle Types #86

wants to merge 9 commits into from

Conversation

kislyuk
Copy link
Member

@kislyuk kislyuk commented Jul 15, 2019

This RFC depends on #93 and further defines the details of how bundle types will be represented in the DSS.

Last call for community review: Sept 6

@kislyuk kislyuk marked this pull request as ready for review July 30, 2019 21:34
@justincc justincc self-requested a review August 15, 2019 17:50
@brianraymor
Copy link
Collaborator

I'm confused by This RFC depends on #93 and further defines the details of how bundle types will be represented in the DSS. when #93 is not approved. That's like taking a dependency in one standard on another which is in draft. It also states Last call for community review: August 30 - was this announced on dcp slack?

@kislyuk
Copy link
Member Author

kislyuk commented Aug 30, 2019

The author of #93 has left the project, so I'm going to take over authorship of that RFC.

Announced on the #dcp channel and extended the community review by a week.

rfcs/text/0000-bundle-types.md Outdated Show resolved Hide resolved

| Bundle Type | Example `bundle-type` value |
|--------------------------|-----------------------------------------------------------|
| Project Metadata Bundle | `hca/project` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my understanding that "Project Metadata Bundle" isn't just classifying an existing type of bundle, but introducing new bundling, right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. A bundle idea that has been kicked around for a long time. Nobody has tried to justify, tightly define or implement it yet to my knowledge.

| Primary Sequence Bundle | `hca/primary-data; hca-data-type:SmartSeq2` |
| Primary Imaging Bundle | `hca/primary-data; hca-data-type:sptx` |
| Secondary Analysis Bundle| `hca/analysis-output; hca-pipeline:snap-atac` |
| Reference Resource Bundle| `hca/analysis-support; hca-resource-type:reference-genome`|

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reasoning behind naming this bundle-type value hca/analysis-support? Something like hca/analysis-reference seems clearer to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samanehsan your suggestion sounds good to me, adopting. Thanks!

| Bundle Type | Example `bundle-type` value |
|--------------------------|-----------------------------------------------------------|
| Project Metadata Bundle | `hca/project` |
| Primary Sequence Bundle | `hca/primary-data; hca-data-type:SmartSeq2` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is hca-data-type here equivalent to the library_construction_method ontology defined in the submission?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I just saw the discussion about this field here: #86 (comment)

@NoopDog
Copy link

NoopDog commented Sep 11, 2019

It seems to me that Mallory raises legitimate concerns of duplicating information that is in the metadata in the bundle type schema. I understand that it can be useful to denormalize/duplicate for efficiency and ease of querying but perhaps we should instead try to make the metadata in the bundles more queryable rather than duplicating metadata and the metadata schema in the this bundle type schema.

I feel we might have more flexibility for the future if we kept to a model based on the elegant process/protocol concepts that already exist in the metadata schema.

Rather than explicitly defining new bundle types we could, for example, have a single bundle type, "process" that is an instance of a protocol that is defined in the metadata standard with all the benefits of the metadata type system. Then everything becomes a DAG of processes with the outputs of one process being the inputs of another, downstream process.

We would have to create some additional processes to reflect some DCP specific activities such as project creation, and more accurately model the library prep/run relationship for example but in the end we would be able to query by process types specified by the metadata schema instead of querying by a 2nd type system (the bundle type system) and having to map between the two systems.

The main idea here is that an "primary sequence bundle" is the output of a sequencing process (an instance of a sequencing protocol).

A "secondary analysis bundle" is the output of a secondary analysis process (instance of a secondary analysis protocol with a specific subtype "Optimus Prime Version x.xx").

For example if you have a process defined (roughly ) as:

PROCESS BUNDLE

process_instance_fquid
protocol_fqid
input_files[] (references , if any)
output_files[] (the actual files)
process_parameters{}
biomaterial_metadata (donor, collection, handling, prep)

This is similar to what we have now and what is proposed, but avoids creating a bundle type schema and lets one subscribe to notifications on protocol_fqid. (or modify to include the protocol name for humans)

This also allows arbitrary "data groups" or "data sets" by defining a "grouping" protocol and creating a "Process Bundle" with a grouping protocol fqid and a list of input file references.

TLDR: Keep bundles semantics free, model analysis as a process/protocol, query by process type instead of bundle type.

@NoopDog NoopDog self-requested a review September 11, 2019 08:22
@kislyuk
Copy link
Member Author

kislyuk commented Sep 11, 2019

@NoopDog in an abstract DSS operating in a vacuum, what you propose makes sense. The problem is that operating without bundle types makes it eternally unclear what types of bundles applications should care about. In your example, how does an application implementer figure out which bundles to get?

In my opinion, after designing and operating Query Service, which attempts to make raw metadata directly queryable, making process/protocol metadata the sole source of information without allowing services to de-normalize it is a detriment to the usability of the system. The process/protocol graph requires sophisticated graph querying techniques to extract information from. It is an over-complexified information representation method for our experimental data.

@diekhans
Copy link
Contributor

It is unclear why there are two RFCs related to bundle (#86 and #93).

They are not clearly delineated into covering different aspects of bundle types. I believe we would be better off merging these two.

@barkasn barkasn self-requested a review September 11, 2019 17:47
Copy link
Contributor

@barkasn barkasn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second the above the two RFCs should be merged

@kislyuk
Copy link
Member Author

kislyuk commented Sep 11, 2019

@diekhans @barkasn #93 addresses the design while #86 addresses implementation details. In my estimation, they can proceed separately, but I don't mind them being merged/unified.

I don't have the bandwidth to unify the two RFCs right now. Can someone else help me with the unification process?

@diekhans
Copy link
Contributor

The authors of the two RFCs are the same, so if none of the authors have the bandwidth to work on them and they are interdependent. If #86 is the implementation of #93, then it #86 will be blocked by #93.

Really, I think the fastest short term approach also has the best long term value, which is to move the relevant part of #93 into #86 and then drop #93.

@kislyuk
Copy link
Member Author

kislyuk commented Sep 11, 2019

Great, thanks @diekhans. When you have some time, please go ahead and merge #93 into this RFC and close #93.

@diekhans
Copy link
Contributor

I am not suggesting I have the bandwidth. Only that the two RFCs have the same authors and they could save themselves and the reviews time by combining them.

@barkasn
Copy link
Contributor

barkasn commented Sep 12, 2019

@kislyuk I won't be able to help with the unification process either. I don't want to block you from proceeding with your work on this RFC so I'll mark as approved, however I see a number of problems here that I think should be resolved and I would suggest you approach the respective POs and TL:

  • It is not clear how the bundle types will be defined, the process to do so needs to be defined in this bundle
  • The bundle types listed in the two RFCs are nearly orthogonal to each other (one defining level of processing and the other data type), there is no clear scope for what the bundle types ought to be
  • It would be highly advantageous if the bundle types re-used an existing ontology or at least mapped several existing ontologies into bundle type ids.
  • A clear way for hierarchical extensibility of the bundle types should be presented and a consideration of how this would impact DSS search should be made

Finally, I would like to point out that the rfc approval level tag needs to be updated

Copy link
Member

@lauraclarke lauraclarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like more discussion and a clear understanding of what is being requested with the bundle type definition registry. Charter responsibility and what is proposed in this RFC are in conflict

rfcs/text/0000-bundle-types.md Outdated Show resolved Hide resolved

### User Stories

1. As a tool developer, I would like to identify and get raw data in the DSS so that I can run my data processing and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a syntax note, numbering is not incrementing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimothyTickle that is intentional. Markdown supports automatic numbered list incrementing, so that items can be re-arranged more easily. The rendered version increments the list properly.

Copy link
Contributor

@TimothyTickle TimothyTickle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this RFC simply formalizes de facto uses of bundles as formal bundles. If that is the intent, the summary should be updated.

Currently, the goal from the summary is "Formalizes data bundle types, definitions, target users/consumers, and specific use cases to improve clarity and confidence among DCP developers when working with the DCP data model.", and it feels like there is more information needed to hit that mark. What is missing for me is why are the bundles in the RFC correct? What criteria define a bundle type, how do we know when we need a new bundle, and when that occurs how is it defined? I do not understand from this document, for instance, why we would have a secondary analysis bundle and an expression matrix bundle (not to argue they should not exist but "why" is missing). If these are all correct, what specific need do each address? As I read the summary, it alludes to potentially target users and use cases may be the key to understanding bundles but that reasoning is not shown. If this is the belief of the author, those driving factors would hopefully be distilled into rules that help better define bundles and help us understand when we need more or when we reuse what we have. Without this, I do not believe I would have confidence in applying them to new problems (beyond of course just doing status quota, which does not seem to be the intent).

I am not interested in dictating the scope of an RFC, I think it is healthy to be able to scope work in this space as the authors and implementation team need but I am concerned that the dependent RFC does not seem complete, relying on those definitions here seems premature.

This is crucial work and the effort of the authors is much appreciated.

Copy link
Contributor

@diekhans diekhans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very important RFC and deserves much more attention. This and RFC PR 93 need be combined, as they are overlapping and have contradictions. Unfortunately, the authors don't have the bandwidth to polish this work.

Requested changes:

This RFC should start by defining what a bundle is, both functionally and as an architectural concept. This is something that is confusing to both the DCP developers and consumers. We have approved RFC 0010 that states "... the meaning of bundle is not stable or clearly defined at the moment." This RFC is the perfect opportunity to remedy this ongoing issue.

Several entries in initial contents of the registry are speculative. We should only include in the actual registry when they are defined by other RFC. These are good as examples.

The actual implementation of the registry should either be defined or indicated as an unresolved issue.

The process for updating the registry should be defined. Does it require an RFC or simply a pull request adding a new type?

This RFC is dependent on a document that appears to be a proposed RFC: "Request for Comments: Media Type usage in the Data Coordination Platform" and seems to be stalled. This needs to be made into an RFC. It seems well written and this should be straight-forward. This can be made non-blocking to this RFC by making it an unresolved issue and creating a ticket.

The relationship of the "Unresolved Questions" items to this RFC is not clear.

@kislyuk kislyuk requested a review from lauraclarke September 15, 2019 20:42
@kislyuk
Copy link
Member Author

kislyuk commented Sep 16, 2019

It feels like this RFC simply formalizes de facto uses of bundles as formal bundles. If that is the intent, the summary should be updated.

Updated the summary to include "de facto".

Currently, the goal from the summary is "Formalizes data bundle types, definitions, target users/consumers, and specific use cases to improve clarity and confidence among DCP developers when working with the DCP data model.", and it feels like there is more information needed to hit that mark. What is missing for me is why are the bundles in the RFC correct? What criteria define a bundle type, how do we know when we need a new bundle, and when that occurs how is it defined? I do not understand from this document, for instance, why we would have a secondary analysis bundle and an expression matrix bundle (not to argue they should not exist but "why" is missing). If these are all correct, what specific need do each address? As I read the summary, it alludes to potentially target users and use cases may be the key to understanding bundles but that reasoning is not shown. If this is the belief of the author, those driving factors would hopefully be distilled into rules that help better define bundles and help us understand when we need more or when we reuse what we have. Without this, I do not believe I would have confidence in applying them to new problems (beyond of course just doing status quota, which does not seem to be the intent).

Added sections on this under Detailed Design, please take a look.

I must say I don't know how to specifically/precisely answer your question about why the bundle types are "correct". To my estimation and as you mentioned, the contents of this RFC reflect the de facto data type hierarchy that we're handling today. I tried to convey an operational justification (applications need to pattern match data for their input) and a heuristic for when new types are needed.

@kislyuk
Copy link
Member Author

kislyuk commented Sep 16, 2019

This RFC should start by defining what a bundle is, both functionally and as an architectural concept. This is something that is confusing to both the DCP developers and consumers. We have approved RFC 0010 that states "... the meaning of bundle is not stable or clearly defined at the moment." This RFC is the perfect opportunity to remedy this ongoing issue.

I added a section under Detailed Design, What is a bundle, which attempts to provide an operational definition. Please take a look.

Several entries in initial contents of the registry are speculative. We should only include in the actual registry when they are defined by other RFC. These are good as examples.

To my knowledge, all of the entries in the table either reflect data types in real bundles or reflect data types in outputs of applications whose operators would like to deposit data into the DCP, but currently cannot because the DCP hasn't given them specific guidance on how to do so. If you have specific objections or corrections, can you please point them out and suggest alternatives?

The actual implementation of the registry should either be defined or indicated as an unresolved issue.
The process for updating the registry should be defined. Does it require an RFC or simply a pull request adding a new type?

Added a paragraph under Registry of bundle types, does that level of detail look sufficient? I went with the simpler option that you propose.

This RFC is dependent on a document that appears to be a proposed RFC: "Request for Comments: Media Type usage in the Data Coordination Platform" and seems to be stalled. This needs to be made into an RFC. It seems well written and this should be straight-forward. This can be made non-blocking to this RFC by making it an unresolved issue and creating a ticket.

That document predates the formal DCP RFC process and as such is not subject to the requirements of our RFC process (just like other external documents cited by our other RFCs). It is not "stalled" - instead, it was agreed upon and implemented.

However, for completeness and to avoid confusion, I imported it into Markdown and opened a PR here: #113.

The relationship of the "Unresolved Questions" items to this RFC is not clear.

Thank you, that was an editing oversight. Removed those questions and added a couple to reflect the discussion above.

@kislyuk kislyuk requested a review from diekhans September 16, 2019 16:12
kislyuk added a commit that referenced this pull request Sep 16, 2019
@kislyuk
Copy link
Member Author

kislyuk commented Sep 16, 2019

Everyone, thank you for the productive comments here. Due to numerous requests, I carved out some time and merged the contents of #86 into #93. I'm closing this PR, let's pick up the discussion in #93. Please quote relevant snippets of comments when discussing. Thank you!

@kislyuk kislyuk closed this Sep 16, 2019
@kislyuk kislyuk deleted the akislyuk-rfc-bundle-types branch September 16, 2019 16:32
@brianraymor brianraymor added the rfc-withdrawn RFC is withdrawn from consideration label Sep 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Architecture rfc-withdrawn RFC is withdrawn from consideration
Projects
None yet
Development

Successfully merging this pull request may close these issues.