Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: HCA DCP Bundle Types and Definitions #93
RFC: HCA DCP Bundle Types and Definitions #93
Changes from 22 commits
14c0b4b
f58d817
f9daf4a
6ce0cc0
c4f3735
ed916a1
e4c5e15
e44d743
044a3bd
4f8e7bc
5028e14
055b9ff
67df958
f898c18
f05242e
8244ca9
fa0a66e
ff6921c
efbcc3a
b6c8828
bb551e7
439cb91
0db9fd2
d968507
7ab8aa6
8a448cb
f4d0722
a7b7953
b14ccb5
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we adopt more and more scientific use cases this definition (1) will become less and less useful. With more use cases, files will need to be dynamically aggregated to fulfill specific use cases (as illustrated in the multlibrary RFA). If this is a hard grouping (a transaction of files) then would you group files only by run or only by release or biological network? None of these would be the right answer, but all would be needed groupings for scientific use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimothyTickle thanks for pointing this out. I updated and expanded the bundle type definition while merging the contents of #86 into this PR - please take another look.
I disagree that the "transactional unit" definition will become less useful. Operationally, a bundle is the output of an application run. It has to be grouped transactionally for data integrity puproses, and decorated with metadata for discoverability purposes. Whether the grouping is done by release, biological network, etc. is up to the application developers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed with @kislyuk ; transactions are key to data integrity and it is necessary for DCP components to think deeply about data integrity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "tertiary (expression)" is unnecessarily confusing. Tertiary analysis can be any meta-analysis, it should not be constrained to be "expression", which itself is a confusing shorthand for "expression matrix".
I propose renaming:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with renaming secondary and tertiary (in fact, that's the point of this RFC!). If we are renaming those to be nouns, we should also rename the primary bundle, perhaps:
To clarify, "Secondary/Analysis bundles" also contain expression data (gene x cell count matrices) in addition to alignment results, so I wonder if it's confusing to have these results not in the "Expression Matrix bundle".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Raw" doesn't seem very meaningful. Someone looking at gene expression (most likely the majority of users, might think of unnormalized counts as "raw".
"Matrix" is a type of data structured, so it seems odd to include it in the name. Kind of like calling sequence experiments "fastaq bundles". How about "gene expression bundle"
Analysis doesn't have much meaning either. I was kind of thinking gene expression is the result of analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gene expression matrices are a result of both pipelines and the matrix service (pipelines make them, matrix service acts on them for users).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why the above bundles were defined in this manner other than this is de facto, but I am not sure that is the intent of this RFC, or at least does not feel like a formal understanding of bundles but more of a documentation of what we are using (without comment or systematic definitions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimothyTickle I believe that is the intent. I don't think we need to "formally understand bundles". We need to define bundle types and annotate them with enough information for application developers to easily reason about them, so they can develop applications and present data to users with reasonable certainty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These new bundle types and definitions are not complete but I would love to review them when they are. Feel free to reach out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about internal users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, great point. Added "and DCP services" to the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this RFC is still in draft status, but I wanted to note here that this AC doesn't seem to fit in with the rest of the document, which focuses on defining bundle types rather than hiding bundle details from data consumers. Maybe it can be re-worded or removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @samanehsan, I agree it's not addressed by the RFC in its current form. I'll remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me the AC above is a valid constraint on the solution, that it does not further bake in the bundle concept to the DCP interface. Users care about releases, the files in them, and the processes used to create the files.
We should be careful not to force hundereds/thousands of users to also learn our "bundle types" when we already have metadata describing the processes that creates the files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NoopDog That may be true, but the content of this RFC as it stands does not change the status quo in that respect nor pass judgment on how data is presented to the user.
More generally, any advocate for removing bundles from the presentation layer should come up with a concrete, actionable proposal (perhaps in another RFC) for what to replace them with.