-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Bundle Types #86
Conversation
I'm confused by This RFC depends on #93 and further defines the details of how bundle types will be represented in the DSS. when #93 is not approved. That's like taking a dependency in one standard on another which is in draft. It also states Last call for community review: August 30 - was this announced on dcp slack? |
The author of #93 has left the project, so I'm going to take over authorship of that RFC. Announced on the #dcp channel and extended the community review by a week. |
rfcs/text/0000-bundle-types.md
Outdated
|
||
| Bundle Type | Example `bundle-type` value | | ||
|--------------------------|-----------------------------------------------------------| | ||
| Project Metadata Bundle | `hca/project` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's my understanding that "Project Metadata Bundle" isn't just classifying an existing type of bundle, but introducing new bundling, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. A bundle idea that has been kicked around for a long time. Nobody has tried to justify, tightly define or implement it yet to my knowledge.
rfcs/text/0000-bundle-types.md
Outdated
| Primary Sequence Bundle | `hca/primary-data; hca-data-type:SmartSeq2` | | ||
| Primary Imaging Bundle | `hca/primary-data; hca-data-type:sptx` | | ||
| Secondary Analysis Bundle| `hca/analysis-output; hca-pipeline:snap-atac` | | ||
| Reference Resource Bundle| `hca/analysis-support; hca-resource-type:reference-genome`| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the reasoning behind naming this bundle-type value hca/analysis-support
? Something like hca/analysis-reference
seems clearer to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samanehsan your suggestion sounds good to me, adopting. Thanks!
rfcs/text/0000-bundle-types.md
Outdated
| Bundle Type | Example `bundle-type` value | | ||
|--------------------------|-----------------------------------------------------------| | ||
| Project Metadata Bundle | `hca/project` | | ||
| Primary Sequence Bundle | `hca/primary-data; hca-data-type:SmartSeq2` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is hca-data-type
here equivalent to the library_construction_method
ontology defined in the submission?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I just saw the discussion about this field here: #86 (comment)
It seems to me that Mallory raises legitimate concerns of duplicating information that is in the metadata in the bundle type schema. I understand that it can be useful to denormalize/duplicate for efficiency and ease of querying but perhaps we should instead try to make the metadata in the bundles more queryable rather than duplicating metadata and the metadata schema in the this bundle type schema. I feel we might have more flexibility for the future if we kept to a model based on the elegant process/protocol concepts that already exist in the metadata schema. Rather than explicitly defining new bundle types we could, for example, have a single bundle type, "process" that is an instance of a protocol that is defined in the metadata standard with all the benefits of the metadata type system. Then everything becomes a DAG of processes with the outputs of one process being the inputs of another, downstream process. We would have to create some additional processes to reflect some DCP specific activities such as project creation, and more accurately model the library prep/run relationship for example but in the end we would be able to query by process types specified by the metadata schema instead of querying by a 2nd type system (the bundle type system) and having to map between the two systems. The main idea here is that an "primary sequence bundle" is the output of a sequencing process (an instance of a sequencing protocol). A "secondary analysis bundle" is the output of a secondary analysis process (instance of a secondary analysis protocol with a specific subtype "Optimus Prime Version x.xx"). For example if you have a process defined (roughly ) as:
This is similar to what we have now and what is proposed, but avoids creating a bundle type schema and lets one subscribe to notifications on protocol_fqid. (or modify to include the protocol name for humans) This also allows arbitrary "data groups" or "data sets" by defining a "grouping" protocol and creating a "Process Bundle" with a grouping protocol fqid and a list of input file references. TLDR: Keep bundles semantics free, model analysis as a process/protocol, query by process type instead of bundle type. |
@NoopDog in an abstract DSS operating in a vacuum, what you propose makes sense. The problem is that operating without bundle types makes it eternally unclear what types of bundles applications should care about. In your example, how does an application implementer figure out which bundles to get? In my opinion, after designing and operating Query Service, which attempts to make raw metadata directly queryable, making process/protocol metadata the sole source of information without allowing services to de-normalize it is a detriment to the usability of the system. The process/protocol graph requires sophisticated graph querying techniques to extract information from. It is an over-complexified information representation method for our experimental data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second the above the two RFCs should be merged
The authors of the two RFCs are the same, so if none of the authors have the bandwidth to work on them and they are interdependent. If #86 is the implementation of #93, then it #86 will be blocked by #93. Really, I think the fastest short term approach also has the best long term value, which is to move the relevant part of #93 into #86 and then drop #93. |
I am not suggesting I have the bandwidth. Only that the two RFCs have the same authors and they could save themselves and the reviews time by combining them. |
@kislyuk I won't be able to help with the unification process either. I don't want to block you from proceeding with your work on this RFC so I'll mark as approved, however I see a number of problems here that I think should be resolved and I would suggest you approach the respective POs and TL:
Finally, I would like to point out that the rfc approval level tag needs to be updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like more discussion and a clear understanding of what is being requested with the bundle type definition registry. Charter responsibility and what is proposed in this RFC are in conflict
|
||
### User Stories | ||
|
||
1. As a tool developer, I would like to identify and get raw data in the DSS so that I can run my data processing and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a syntax note, numbering is not incrementing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimothyTickle that is intentional. Markdown supports automatic numbered list incrementing, so that items can be re-arranged more easily. The rendered version increments the list properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like this RFC simply formalizes de facto uses of bundles as formal bundles. If that is the intent, the summary should be updated.
Currently, the goal from the summary is "Formalizes data bundle types, definitions, target users/consumers, and specific use cases to improve clarity and confidence among DCP developers when working with the DCP data model.", and it feels like there is more information needed to hit that mark. What is missing for me is why are the bundles in the RFC correct? What criteria define a bundle type, how do we know when we need a new bundle, and when that occurs how is it defined? I do not understand from this document, for instance, why we would have a secondary analysis bundle and an expression matrix bundle (not to argue they should not exist but "why" is missing). If these are all correct, what specific need do each address? As I read the summary, it alludes to potentially target users and use cases may be the key to understanding bundles but that reasoning is not shown. If this is the belief of the author, those driving factors would hopefully be distilled into rules that help better define bundles and help us understand when we need more or when we reuse what we have. Without this, I do not believe I would have confidence in applying them to new problems (beyond of course just doing status quota, which does not seem to be the intent).
I am not interested in dictating the scope of an RFC, I think it is healthy to be able to scope work in this space as the authors and implementation team need but I am concerned that the dependent RFC does not seem complete, relying on those definitions here seems premature.
This is crucial work and the effort of the authors is much appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very important RFC and deserves much more attention. This and RFC PR 93 need be combined, as they are overlapping and have contradictions. Unfortunately, the authors don't have the bandwidth to polish this work.
Requested changes:
This RFC should start by defining what a bundle is, both functionally and as an architectural concept. This is something that is confusing to both the DCP developers and consumers. We have approved RFC 0010 that states "... the meaning of bundle is not stable or clearly defined at the moment." This RFC is the perfect opportunity to remedy this ongoing issue.
Several entries in initial contents of the registry are speculative. We should only include in the actual registry when they are defined by other RFC. These are good as examples.
The actual implementation of the registry should either be defined or indicated as an unresolved issue.
The process for updating the registry should be defined. Does it require an RFC or simply a pull request adding a new type?
This RFC is dependent on a document that appears to be a proposed RFC: "Request for Comments: Media Type usage in the Data Coordination Platform" and seems to be stalled. This needs to be made into an RFC. It seems well written and this should be straight-forward. This can be made non-blocking to this RFC by making it an unresolved issue and creating a ticket.
The relationship of the "Unresolved Questions" items to this RFC is not clear.
Updated the summary to include "de facto".
Added sections on this under Detailed Design, please take a look. I must say I don't know how to specifically/precisely answer your question about why the bundle types are "correct". To my estimation and as you mentioned, the contents of this RFC reflect the de facto data type hierarchy that we're handling today. I tried to convey an operational justification (applications need to pattern match data for their input) and a heuristic for when new types are needed. |
I added a section under Detailed Design, What is a bundle, which attempts to provide an operational definition. Please take a look.
To my knowledge, all of the entries in the table either reflect data types in real bundles or reflect data types in outputs of applications whose operators would like to deposit data into the DCP, but currently cannot because the DCP hasn't given them specific guidance on how to do so. If you have specific objections or corrections, can you please point them out and suggest alternatives?
Added a paragraph under Registry of bundle types, does that level of detail look sufficient? I went with the simpler option that you propose.
That document predates the formal DCP RFC process and as such is not subject to the requirements of our RFC process (just like other external documents cited by our other RFCs). It is not "stalled" - instead, it was agreed upon and implemented. However, for completeness and to avoid confusion, I imported it into Markdown and opened a PR here: #113.
Thank you, that was an editing oversight. Removed those questions and added a couple to reflect the discussion above. |
This RFC depends on #93 and further defines the details of how bundle types will be represented in the DSS.
Last call for community review: Sept 6