Contributed Pipelines RFC #116

kbergin · 2019-09-17T17:59:35Z

This RFC outlines a new process for pipelines coming into the DCP. It includes proposals for technical and scientific requirements of those pipelines and a draft of user facing contribution guidelines.

Status: Community Review
Last call for community review: October 8th

cc @ambrosejcarr @TimothyTickle

jahilton

This sounds very much in line with Aim 1 from the first draft of the DCP Strategy, but the PLs revised that to a Strategy that drastically de-emphasizes contributed pipelines. In general, this is a LOT of work that adds value for a very limited number of users, and without clear plans laid out to prevent confusion and uncertainty, it actually risks reducing value for many users.

jahilton · 2019-09-17T18:32:54Z

rfcs/text/0000-contributed-pipelines.md

+This RFC discusses (1) technical and scientific standards to determine when pipelines contributed by community members are ready for inclusion in the DCP (2), a potential contribution process, (3) the prioritization of such pipelines, and (4) off-boarding pipelines when they lose value or cease to fulfill standard requirements.
+
+## User Stories
+- As a user of the DCP (both user archetypes, but researcher with a pipette especially), I am looking for all raw data in the HCA DCP to be analyzed. I trust the scientific community to write high quality pipelines. 


This user story can be accomplished by taking contributed pipelines only for data types that don't have an existing pipeline available, which I am interpreting to be a much more limited scope that the current RFC.

The current RFC doesn't explicitly state that we won't accept pipelines for data types that already have other pipelines, but it does state that the choice to prioritize adding a contributed pipeline would be based on many factors including "Whether there is an existing pipeline in the HCA that serves the same data as the contributed pipeline."

If we had multiple pipelines contributed for the same data type we also stated that we could deprecate one of them because "Pipeline is one of several competing pipelines for an assay and consensus is reached that one of the other pipelines is preferred."

I agree that having many pipelines for one assay could be confusing to users if not done carefully. I personally would be comfortable with trying to limit it, at least at first, but I know that adds risks of choosing favorites, or assigning winner to the pipeline that got there first, which some folks who helped with this draft didn't love the idea of.

I agree that 'winner based on who got there first' should be avoided. like Ambrose's idea below of "The DCP team sources contributions by opening issues for specific assays". This may mitigate that concern by 'opening a call', setting deadlines, and allowing valuable scientific discussions weighing different pipelines.
So I do find this an incredibly compelling User Story (and far more compelling than the other two), but I envisioned a less riskier proposal with tighter bounds and still fulfills this User Story.

rfcs/text/0000-contributed-pipelines.md

jahilton · 2019-09-17T18:36:16Z

rfcs/text/0000-contributed-pipelines.md

+- Provide pipeline outputs using standard formats (if such standards exist)
+- Utilize public, open source tools.
+
+Data produced by Contributed Pipelines will be clearly marked as non-release data, to distinguish it from data eligible for release that was generated by AWG-vetted pipelines. Not all DCP services will be available for non-release data. 


Can you clarify this? The way I read it is that the DCP will take pipelines but never release any results that was generated by them.

You've interpreted this correctly, that the analysis results would be available in the DCP but not in any HCA releases.

Do you have a suggestion on how to write it more clearly? Thanks!

I think, based on the definitions in the DataOps charter (not approved yet, but we're trying to set some terms), this is intended to be read something like...

Data produced by Contributed Pipelines will be accessible to the community, but will be clearly distinguished from data produced by AWG-vetted pipelines and will not be eligible for inclusion in HCA Snapshots or other data distributions. There may be other DCP services that also include only data from AWG-vetted pipelines.

I think I'm still confused about the "clearly marked" portion of this sentence. Where is the marking ideally located?

Also, we should probably be clear on which DCP services will have this data from non-AWG-vetted pipelines to the user... how would we best designate that?

jahilton · 2019-09-17T18:37:42Z

rfcs/text/0000-contributed-pipelines.md

+Although the characteristics of contributed pipelines are not as well understood as standard HCA pipelines, we should takes steps to avoid installing a pipeline into production that could produce misleading or bad scientific data. To accomplish this, contributed pipelines should demonstrate that they have been vetted by members of the scientific community, in addition to meeting the technical requirements for consideration. Contributed pipelines must be shown to produce meaningful scientific results, a requirement that is met by any of the following:
+-  Has produced data that is used extensively in analysis found in a published, peer-reviewed manuscript.
+- Produces data that is shown to replicate analysis found in a published, peer-reviewed paper.
+- Is a known pipeline to an AWG member who is willing to vouch for the pipeline.


This puts the statement above about avoiding misleading/bad scientific data at great risk. If that is the goal, then requirements should be purely scientific to reflect that.

jahilton · 2019-09-17T18:38:07Z

rfcs/text/0000-contributed-pipelines.md

+-  Has produced data that is used extensively in analysis found in a published, peer-reviewed manuscript.
+- Produces data that is shown to replicate analysis found in a published, peer-reviewed paper.
+- Is a known pipeline to an AWG member who is willing to vouch for the pipeline.
+- Is a pipeline that is used by 3 or more experts in the field, all of whom confirm they have used it to successfully analyze data and can point to that analysis.


Again, this is drifting from scientific validation.

In this proposal, we don't suggest a requirement for scientific validation per se, or at least not in the same way we do validations on the pipelines team. We tried to identify ways that science is validated in the scientific community as reasonable markers for scientific validation for this process.

In your experience, do you think this and the bullet above wouldn't be sufficient to trust the scientific validity of a pipeline? Do you have ideas on other requirements we could add that could help us trust the scientific validity of the pipelines? Thanks!

I want to trust this bullet point, I really do. But I also drift towards worst-case scenarios. I do recognize that the idea is to lower the bar for data to get processed and it's a tough balance between 'users want more data' and 'users want reliable data'. Maybe instead of just 'used', some specification of 'used in X high profile peer-reviewed studies'?

...and a lot of this ultimately comes down to how it is messaged to the data consumer

jahilton · 2019-09-17T18:44:14Z

rfcs/text/0000-contributed-pipelines.md

+Contributing pipelines via Terra has a few benefits:
+1. Ensures the pipeline can be run in the DCP pipeline infrastructure
+2. Provides support for contributors developing and testing the pipelines
+3. Makes the pipelines immediately available to the community, where users can run the pipelines on HCA data, in the case where they may be useful but not immediately able to be accepted to the DCP, for example if they haven’t met an acceptance criteria, or aren’t prioritized into production because of a reason in section 3.3


Makes the pipelines immediately available to the community...of Terra users. Right?

If the pipeline is public in Terra it's accessible to anyone, and anyone could make an account if they wanted to clone and run it. If the person making the pipeline is savvy enough, they could submit it through Dockstore and then show it running in Terra. If you'd like, we could add that information to this process?

There was some contention among the authors on this point. An alternative proposal for contribution took the following form:

The DCP team sources contributions by opening issues for specific assays. Community members can also open issues to discuss potential contributions that are not sought by the team or on the roadmap.

Contribution occurs via PR to humancellatlas/skylab

Contribution require users to write tests that match and pass skylab CI, which requires (i) construction of docker containers, (ii) specification of test inputs and (iii) specification of expected results

Some considerations:

This requires contributors to understand the github issue>fork>pr open source contribution process, which may have both positive and negative impacts

Users need not have access to google cloud resources, because they are provided by skylab CI.

Contributors are rewarded for their efforts with commits recorded directly against the DCP pipelines code, motivating direct scientific collaboration on our pipelines product.

I really like the more targeted approach of "DCP...opening issues for specific assays". I think this also will invite more community input & discussion, which will result in fruitful scientific discussions around the pipelines, especially when 2+ are proposed, and ultimately community buy-in.

jahilton · 2019-09-17T18:46:58Z

rfcs/text/0000-contributed-pipelines.md

+- No new data produced in 12 months
+- Pipeline is one of several competing pipelines for an assay and consensus is reached that one of the other pipelines is preferred.
+- A requirement for inclusion ceases to be met (see “Standards for consideration” section)
+- Operational failure rate in a quarter surpasses 2% .


For reference, what's the current failure rate for DCP pipelines & what's the quarterly history since we started using them?

I can look into those numbers, and add them here.

One thing to keep in mind is that even if the failure rate is higher than 2%, with standard pipelines we have made in the DCP we have thorough understanding of the code, analysis, and data and so it is much easier to debug and update. In comparison to contributed pipelines we may only minimally understand and have to do more time consuming communication with the contributor to get debugged.

2% failure rate is very high for any high volume pipeline, but acceptable for low volume pipelines. I would suggest tiers for acceptable failure rates based on the number of pipeline invocations and length of time they have been running for. With respect to the length of time, I would suggest a burn-in period during which a higher fail rate is acceptable. After that failure rate should be below 1% for all pipelines and much lower for high volume pipelines.

jahilton · 2019-09-17T18:49:40Z

rfcs/text/0000-contributed-pipelines.md

+## Prerequisite work to enable pipeline contribution
+The DCP needs to fulfill the following capabilities before we can begin to accept contributed pipelines. These systems could be very light-weight to start (e.g. we could manually email users failure logs)
+
+- The DCP must distinguish between contributed pipeline data and releasable data


Again, I need some clarity here. Though I think this is touching on another significant point where confusion is easily introduced to our users and there's no clear plan for how to avoid confusion: results from many pipelines available for one dataset.
If one set of data is run 6 different ways, how are those results presented to the user? Some users will just want 1 set of results and don't care about what software was just, so how straightforward will it be for them to grab the "preferred" results? Who determines what are "preferred" or "default" results?

Great point! This same problem comes up with just having new versions of analyses in the DCP even with standard pipelines. We'll have to determine how we want to present that information to users as well.

I'd love to see us not spend too much time on having many pipelines for one data type. Possibly 2 but once we get to 3 I'd recommend we look into deprecating one of them.

Although in that situation, given we don't delete old analyses, we'll still have many analysis results around in the DCP, even when the pipelines are deprecated.

We would definitely ensure that the user experience of having results from contributed pipelines and results from multiple pipelines is thought through and addressed if this RFC is approved.

Releases of HCA data would not have more than one set of analysis results for each set of data.

If we have more than one analysis available for a dataset we will need an interface that allows users to browse the alternative analyses. Once we have that there is no reason not to keep multiple versions of the processing results. So we should either stick with 1 or allow many, I don't think keeping the numbers low helps.

@jahilton I appreciate your perspective here in this and other comments reviewing this RFC specifically highlighting the confusion that would be caused by having more than 1 pipeline for an assay. I am very much aligned with you and think there would need to be very dramatic value before such a scenario would be considered. The one scenario that was convincing to me was the scenario of sunsetting a pipeline. Given a current pipeline in production that is going to be changed over to a drastically new pipeline, it may be helpful to run both pipelines so those using the data can confirm they are comfortable with the new pipeline. (Of course, we will also have reports to point to as well). This may also allow the completion of processing of data for a snapshot or a distribution with the soon to be legacy pipeline as the new pipeline is brought in. This RFC keeps open the possibility but if you would feel more comfortable we can communicate this is more expect as a rare event.

@TimothyTickle Addressing the specific case of sunsetting a pipeline in favor of a drastically different pipeline, I think the majority of users are going to expect the DCP to do the confirmation for them (the reports that you mentioned). If we expect some portion of the community to assist in the validation of a new pipeline, then that should process should be explicitly laid out. But if the DCP is switching to a new pipeline, then I expect it will be for good reason, and we won't cling on to the old pipeline.
I would expect the DCP handle sunsetting a pipeline in favor of a new one similar to how the DCP will handle sunsetting a pipeline version in favor of a new version of the same pipeline (and thus, there will be multiple sets of analysis products, but only one produced by a current pipeline). Results from all pipeline versions would be accessible, but only results from a single in-use pipeline would be 'preferred'. I recognize that this scenario is inevitable, and that the specifics of how to communicate results to the user have not been decided. I do not think this RFC is the venue to make those decisions.
So I am specifically concerned about data that has more than one in-use pipeline at a given time. Either specifying that such a case will not occur, or giving more thought on how the DCP would handle that case through the users' eyes.

@jahilton We are very aligned here and I have no strong concerns in removing the possibility of multiple pipelines for an assay in rare events. Operating 1 pipeline for 1 assay is the way I envision the DCP working. Unless there are other strong reactions to this, happy to have that language removed. Pinging a couple of people historically involved in this conversation to see if they still have strong feelings here. @jonahcool @ambrosejcarr @mckinsel

rfcs/text/0000-contributed-pipelines.md

jahilton · 2019-09-17T21:28:34Z

rfcs/text/0000-contributed-pipelines.md

+## User Stories
+- As a user of the DCP (both user archetypes, but researcher with a pipette especially), I am looking for all raw data in the HCA DCP to be analyzed. I trust the scientific community to write high quality pipelines. 
+- As a computational biologist or methodologist, I have a pipeline that I would like to see leveraged in the HCA DCP.
+- As the HCA DCP, I want to develop pipelines iteratively based on user feedback. To accomplish this, I need new pipelines to be leveraged in the DCP as quickly as possible. 


I took this a step further and connected it to the data user, and I think frequently-updating analysis results do not add value for most of them. This is one significant point where confusion can easily be introduced to the user, and there is no clear plan for how to handle results from many versions of a pipeline for 1 dataset.

I agree that even with multiple versions of the same HCA standard pipelines we don't currently have great ways to demonstrate that the users have access to many versions of analysis results in the DCP. I would imagine that latest would always be shown in the browser, that a dropdown would let users filter to older versions of analysis results, that users could pin to specific analysis results by a collection manifest, and that releases could also help users pin to specific versions. But those are just ideas and it's not up to me :)

Part of the modeling is proposed in this RFC, but I think there's still an access solution missing. But you solution sounds beautifully reasonable to me!

Sorry to nitpick, but I'd like some clarity on what "leverage" means in this sentence as well. Is it just simply making it accessible? Does it involve other work?

barkasn · 2019-09-19T12:12:54Z

Please add a section on code quality standards. There should be a requirement for code to adhere to best practices for the language it is written in. This includes WDL, where the level of granularity for individual tasks required (if any) is specified.

This should also include code documentation standards.

maniarathi · 2019-10-11T05:06:58Z

rfcs/text/0000-contributed-pipelines.md

+ `[Name](mailto:[email protected])`
+
+## Motivation
+As the DCP accepts data from assays, it takes on responsibility for eventually processing that data and returning analysis products to our contributors. The core mission of the HCA DCP is to process and make available the diverse data types comprising the reference atlas. 


"returning analysis products to our contributors" -> mostly a nit but we have two sets of users, the contributors and the consumers. In this particular statement, you mention the consumers as being the original contributors but I think the value add is primarily for new data consumers, right?

maniarathi · 2019-10-11T05:08:39Z

rfcs/text/0000-contributed-pipelines.md

+This RFC discusses (1) technical and scientific standards to determine when pipelines contributed by community members are ready for inclusion in the DCP (2), a potential contribution process, (3) the prioritization of such pipelines, and (4) off-boarding pipelines when they lose value or cease to fulfill standard requirements.
+
+## User Stories
+- As a user of the DCP (both user archetypes, but researcher with a pipette especially), I am looking for all raw data in the HCA DCP to be analyzed. I trust the scientific community to write high quality pipelines. 


"both user archetype" -> Could you be more specific please?

"I trust the scientific community to write high quality pipelines" is, in my mind, a whole new user story around the expectations of quality by the user so I'd recommend to break it out to be more explicit.

maniarathi · 2019-10-11T05:10:02Z

rfcs/text/0000-contributed-pipelines.md

+
+## User Stories
+- As a user of the DCP (both user archetypes, but researcher with a pipette especially), I am looking for all raw data in the HCA DCP to be analyzed. I trust the scientific community to write high quality pipelines. 
+- As a computational biologist or methodologist, I have a pipeline that I would like to see leveraged in the HCA DCP.


Sorry, what does it mean to "be leveraged"? Just looking for here specificity here (i.e. they'd like more data than their own to be analyzed? they'd like their data to be used by other data consumers? etc.)

maniarathi · 2019-10-11T05:11:24Z

rfcs/text/0000-contributed-pipelines.md

+## User Stories
+- As a user of the DCP (both user archetypes, but researcher with a pipette especially), I am looking for all raw data in the HCA DCP to be analyzed. I trust the scientific community to write high quality pipelines. 
+- As a computational biologist or methodologist, I have a pipeline that I would like to see leveraged in the HCA DCP.
+- As the HCA DCP, I want to develop pipelines iteratively based on user feedback. To accomplish this, I need new pipelines to be leveraged in the DCP as quickly as possible. 


Sorry to nitpick, but I'd like some clarity on what "leverage" means in this sentence as well. Is it just simply making it accessible? Does it involve other work?

maniarathi · 2019-10-11T05:12:27Z

rfcs/text/0000-contributed-pipelines.md

+- As the HCA DCP, I want to develop pipelines iteratively based on user feedback. To accomplish this, I need new pipelines to be leveraged in the DCP as quickly as possible. 
+
+## Definitions
+**Assay:** A biological and technological process that transforms a biological specimen into interpretable raw data, ideally of a known and standardized format. Generates output which must adhere to a specific raw data format.


Are "standardized format" and "a specific raw data format" referring to the same thing here? I wasn't sure if there's a difference.

maniarathi · 2019-10-11T05:15:24Z

rfcs/text/0000-contributed-pipelines.md

+- Not have any restrictions on the use of pipeline output(s).
+- Have multiple test data available for use by the Data Processing Service to evaluate the validity of the results and serve as a benchmark for future improvements
+- Provide acceptable ranges and a method for validation of any required run-specific parameters (example: starfish)
+- Provide pipeline outputs using standard formats (if such standards exist)


If "such standards" don't exist, do we still accept the pipeline?

maniarathi · 2019-10-11T05:16:26Z

rfcs/text/0000-contributed-pipelines.md

+- Provide pipeline outputs using standard formats (if such standards exist)
+- Utilize public, open source tools.
+
+Data produced by Contributed Pipelines will be clearly marked as non-release data, to distinguish it from data eligible for release that was generated by AWG-vetted pipelines. Not all DCP services will be available for non-release data. 


I think I'm still confused about the "clearly marked" portion of this sentence. Where is the marking ideally located?

maniarathi · 2019-10-11T05:17:53Z

rfcs/text/0000-contributed-pipelines.md

+- Provide pipeline outputs using standard formats (if such standards exist)
+- Utilize public, open source tools.
+
+Data produced by Contributed Pipelines will be clearly marked as non-release data, to distinguish it from data eligible for release that was generated by AWG-vetted pipelines. Not all DCP services will be available for non-release data. 


Also, we should probably be clear on which DCP services will have this data from non-AWG-vetted pipelines to the user... how would we best designate that?

maniarathi · 2019-10-11T05:18:09Z

rfcs/text/0000-contributed-pipelines.md

+
+Data produced by Contributed Pipelines will be clearly marked as non-release data, to distinguish it from data eligible for release that was generated by AWG-vetted pipelines. Not all DCP services will be available for non-release data. 
+
+These requirements, and in particular the maintenance requirements, should be clearly communicated to contributors, and the DCP should make an effort to verify that the contributors understand.


What's the method of communication?

maniarathi · 2019-10-11T05:25:31Z

rfcs/text/0000-contributed-pipelines.md

+
+## Operating pipelines
+### Responding to Failures
+Operating any DCP pipeline on relevant data may result in occasional pipeline failures. The pipelines team will use the following procedure to resolve these failures:


I'm a little nervous that the wording around this section implies that pipeline failures should be dealt with "as soon as possible." The reality is that as you grow the number of pipelines, you are going to get more failures and each failure is going to take longer to debug because there are increasing number of code paths that could have caused the issue.

We need some notion of SLOs that sets expectations on when a user might be able to expect data processed by a broken pipeline (that is hopefully pretty generous) and prioritize AWG-vetted pipeline fixes above anything else. I'm not sure if there's a particular team that will be delegated to pipeline maintenance but the eng team will likely have to juggle maintenance with other priorities in the DCP which is something to be mindful of.

kbergin added 3 commits June 7, 2019 16:02

contributed pipelines template commit

0cfeaf8

Merge remote-tracking branch 'upstream/master'

b5f2500

RFC ready

85b374f

kbergin added the rfc-community-review label Sep 17, 2019

jahilton reviewed Sep 17, 2019

View reviewed changes

TimothyTickle self-assigned this Sep 26, 2019

maniarathi reviewed Oct 11, 2019

View reviewed changes


		Data produced by Contributed Pipelines will be clearly marked as non-release data, to distinguish it from data eligible for release that was generated by AWG-vetted pipelines. Not all DCP services will be available for non-release data.

		These requirements, and in particular the maintenance requirements, should be clearly communicated to contributors, and the DCP should make an effort to verify that the contributors understand.

Contributed Pipelines RFC #116

Are you sure you want to change the base?

Contributed Pipelines RFC #116

Conversation

kbergin commented Sep 17, 2019 • edited Loading

jahilton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

barkasn commented Sep 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kbergin commented Sep 17, 2019 •

edited

Loading

barkasn commented Sep 19, 2019 •

edited

Loading