Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contributed Pipelines RFC #116

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
161 changes: 161 additions & 0 deletions rfcs/text/0000-contributed-pipelines.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
### DCP PR:

***Leave this blank until the RFC is approved** then the **Author(s)** must create a link between the assigned RFC number and this pull request in the format:*

`[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/<PR#>)`

# Contributed Pipelines
## Summary
We would like to provide a greater variety of analyzed data to users and engage with the scientific community by accepting community contributed pipelines into the HCA DCP. This RFC proposes technical and scientific requirements for these pipelines as well as draft guidelines for the contribution process.

## Author(s)
[Kylee Degatano](mailto:[email protected])

[Ambrose Carr](mailto:[email protected])

In partnership with Kathleen Tibbetts, Tim Tickle, Clare Bernard, Marcus Kinsella, and the DCP Data Pipelines team.

## Shepherd
***Leave this blank.** This role is assigned by DCP PM to guide the **Author(s)** through the RFC process.*

*Recommended format for Shepherds:*

`[Name](mailto:[email protected])`

## Motivation
As the DCP accepts data from assays, it takes on responsibility for eventually processing that data and returning analysis products to our contributors. The core mission of the HCA DCP is to process and make available the diverse data types comprising the reference atlas.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"returning analysis products to our contributors" -> mostly a nit but we have two sets of users, the contributors and the consumers. In this particular statement, you mention the consumers as being the original contributors but I think the value add is primarily for new data consumers, right?


This is a complex task best accomplished by learning from the scientific community. By rapidly inducting existing pipelines into the DCP, and then improving them based on user feedback and demand, we can quickly build capacity to process diverse data types. By focusing on adding breadth across data types, we provide a platform to advance computational and assay diversity in support of the HCA, improving its quality and accelerating its construction.

This RFC discusses (1) technical and scientific standards to determine when pipelines contributed by community members are ready for inclusion in the DCP (2), a potential contribution process, (3) the prioritization of such pipelines, and (4) off-boarding pipelines when they lose value or cease to fulfill standard requirements.

## User Stories
- As a user of the DCP (both user archetypes, but researcher with a pipette especially), I am looking for all raw data in the HCA DCP to be analyzed. I trust the scientific community to write high quality pipelines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This user story can be accomplished by taking contributed pipelines only for data types that don't have an existing pipeline available, which I am interpreting to be a much more limited scope that the current RFC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current RFC doesn't explicitly state that we won't accept pipelines for data types that already have other pipelines, but it does state that the choice to prioritize adding a contributed pipeline would be based on many factors including "Whether there is an existing pipeline in the HCA that serves the same data as the contributed pipeline."

If we had multiple pipelines contributed for the same data type we also stated that we could deprecate one of them because "Pipeline is one of several competing pipelines for an assay and consensus is reached that one of the other pipelines is preferred."

I agree that having many pipelines for one assay could be confusing to users if not done carefully. I personally would be comfortable with trying to limit it, at least at first, but I know that adds risks of choosing favorites, or assigning winner to the pipeline that got there first, which some folks who helped with this draft didn't love the idea of.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that 'winner based on who got there first' should be avoided. like Ambrose's idea below of "The DCP team sources contributions by opening issues for specific assays". This may mitigate that concern by 'opening a call', setting deadlines, and allowing valuable scientific discussions weighing different pipelines.
So I do find this an incredibly compelling User Story (and far more compelling than the other two), but I envisioned a less riskier proposal with tighter bounds and still fulfills this User Story.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"both user archetype" -> Could you be more specific please?

"I trust the scientific community to write high quality pipelines" is, in my mind, a whole new user story around the expectations of quality by the user so I'd recommend to break it out to be more explicit.

- As a computational biologist or methodologist, I have a pipeline that I would like to see leveraged in the HCA DCP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what does it mean to "be leveraged"? Just looking for here specificity here (i.e. they'd like more data than their own to be analyzed? they'd like their data to be used by other data consumers? etc.)

- As the HCA DCP, I want to develop pipelines iteratively based on user feedback. To accomplish this, I need new pipelines to be leveraged in the DCP as quickly as possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this a step further and connected it to the data user, and I think frequently-updating analysis results do not add value for most of them. This is one significant point where confusion can easily be introduced to the user, and there is no clear plan for how to handle results from many versions of a pipeline for 1 dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that even with multiple versions of the same HCA standard pipelines we don't currently have great ways to demonstrate that the users have access to many versions of analysis results in the DCP. I would imagine that latest would always be shown in the browser, that a dropdown would let users filter to older versions of analysis results, that users could pin to specific analysis results by a collection manifest, and that releases could also help users pin to specific versions. But those are just ideas and it's not up to me :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the modeling is proposed in this RFC, but I think there's still an access solution missing. But you solution sounds beautifully reasonable to me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to nitpick, but I'd like some clarity on what "leverage" means in this sentence as well. Is it just simply making it accessible? Does it involve other work?


## Definitions
**Assay:** A biological and technological process that transforms a biological specimen into interpretable raw data, ideally of a known and standardized format. Generates output which must adhere to a specific raw data format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are "standardized format" and "a specific raw data format" referring to the same thing here? I wasn't sure if there's a difference.


**Pipeline:** A collection of one or more functional tasks that operate on input data and, from that, transform the input data, or derive features often used to interpret the input data. In a high throughput setting, these tasks are often automated to be performed in a batch setting.

## Detailed Design
### Criteria for Consideration
#### Technical standards for consideration
The DCP commits to processing included assay types with high quality pipelines that can operate at scale. Thus, there are some basic technical standards that a pipeline must deliver to be eligible for the DCP. A pipeline must:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barkasn made a general comment about this as well but I'd like to understand better what "high quality" means. There are mentions of testing in the bullet points below but are we requiring a minimum test coverage for example? How many datasets should be available for use? How are "acceptable ranges and a method of validation" determined?


- Be open source (with MIT, ISC, Simplified BSD, or Apache 2.0 license), and on GitHub or similar code versioning and sharing platform.
- Be under active development (e.g. recent commits/releases, responsiveness to bug reports, known pipeline maintainer(s)), with docs, bug fixes, and testing.
- Not have any restrictions on the use of pipeline output(s).
- Have multiple test data available for use by the Data Processing Service to evaluate the validity of the results and serve as a benchmark for future improvements
- Provide acceptable ranges and a method for validation of any required run-specific parameters (example: starfish)
- Provide pipeline outputs using standard formats (if such standards exist)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "such standards" don't exist, do we still accept the pipeline?

- Utilize public, open source tools.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this requirement entail?


Data produced by Contributed Pipelines will be clearly marked as non-release data, to distinguish it from data eligible for release that was generated by AWG-vetted pipelines. Not all DCP services will be available for non-release data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this? The way I read it is that the DCP will take pipelines but never release any results that was generated by them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've interpreted this correctly, that the analysis results would be available in the DCP but not in any HCA releases.

Do you have a suggestion on how to write it more clearly? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, based on the definitions in the DataOps charter (not approved yet, but we're trying to set some terms), this is intended to be read something like...

Data produced by Contributed Pipelines will be accessible to the community, but will be clearly distinguished from data produced by AWG-vetted pipelines and will not be eligible for inclusion in HCA Snapshots or other data distributions. There may be other DCP services that also include only data from AWG-vetted pipelines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm still confused about the "clearly marked" portion of this sentence. Where is the marking ideally located?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should probably be clear on which DCP services will have this data from non-AWG-vetted pipelines to the user... how would we best designate that?


These requirements, and in particular the maintenance requirements, should be clearly communicated to contributors, and the DCP should make an effort to verify that the contributors understand.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the method of communication?


#### Scientific standards for consideration
Although the characteristics of contributed pipelines are not as well understood as standard HCA pipelines, we should takes steps to avoid installing a pipeline into production that could produce misleading or bad scientific data. To accomplish this, contributed pipelines should demonstrate that they have been vetted by members of the scientific community, in addition to meeting the technical requirements for consideration. Contributed pipelines must be shown to produce meaningful scientific results, a requirement that is met by any of the following:
- Has produced data that is used extensively in analysis found in a published, peer-reviewed manuscript.
- Produces data that is shown to replicate analysis found in a published, peer-reviewed paper.
- Is a known pipeline to an AWG member who is willing to vouch for the pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This puts the statement above about avoiding misleading/bad scientific data at great risk. If that is the goal, then requirements should be purely scientific to reflect that.

- Is a pipeline that is used by 3 or more experts in the field, all of whom confirm they have used it to successfully analyze data and can point to that analysis.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is drifting from scientific validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this proposal, we don't suggest a requirement for scientific validation per se, or at least not in the same way we do validations on the pipelines team. We tried to identify ways that science is validated in the scientific community as reasonable markers for scientific validation for this process.

In your experience, do you think this and the bullet above wouldn't be sufficient to trust the scientific validity of a pipeline? Do you have ideas on other requirements we could add that could help us trust the scientific validity of the pipelines? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to trust this bullet point, I really do. But I also drift towards worst-case scenarios. I do recognize that the idea is to lower the bar for data to get processed and it's a tough balance between 'users want more data' and 'users want reliable data'. Maybe instead of just 'used', some specification of 'used in X high profile peer-reviewed studies'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and a lot of this ultimately comes down to how it is messaged to the data consumer

- Is actively being used in a scientific consortium and has the endorsement of their Analysis Working Group or equivalent scientific leadership.

### Contribution process
#### Draft of user-facing contribution guidelines
To contribute a pipeline, you can create a workspace in Terra, a cloud platform for batch and interactive analysis.The creation of this workspace should provide as much information as possible to enable the pipelines team to hook up the pipeline to the DCP. To contribute, you will need to follow the steps outlined below. If you have questions about the contribution process, please contact [email protected].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The User Story As a computational biologist or methodologist, I have a pipeline that I would like to see leveraged in the HCA DCP hits significantly fewer users with narrowing the scope to Terra users. Terra is also not mentioned in the Technical standards section, but it seems like it should be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback. We debated this greatly when writing the RFC and concluded that the simplest way to support contribution to the DCP as soon as possible would be to confirm the pipeline can run in the DCP by showing it running in Terra. There are a lot of pros and cons to this choice and we're open to revisiting it/expanding the contribution process options if this program is implemented and successful. We're generally hesitant / cautious surrounding this entire contributed pipelines idea.

1. Write a WDL 1.0 workflow that encapsulates the pipeline. The tasks of this pipeline will need to be containerized in order to run in the cloud and in Terra. The containers must be public to be accepted into the HCA.
2. Upload the WDL to the public Terra tools repository, with configurations, data, and descriptions for each mode the pipeline can be run in, and import the tool into a workspace.
3. Another option is to put the tool into Dockstore and then link it to Terra.
3. Upload small testing data, necessary references, and any benchmarking datasets to the workspace. Ensure these data are eligible for public, open use.
4. Write a markdown formatted workspace description that summarizes:
1. the data being analyzed,
2. the way the input data is generated,
3. the computational stages of the pipeline,
4. the output data,
5. and how the pipeline meets the scientific and technical contribution standards
5. Run the pipeline in the workspace, in each mode, with at least the test data, so that the outputs can be verified. If there are data that work with your pipeline in the HCA DCP, demonstrate analyzing this data with your pipeline in the workspace.
6. Using another wdl tool, write a checker test that verifies the outputs of the pipeline meet the technical and scientific expectations for each run mode of the pipeline.
7. Share the workspace with write access to [email protected] for internal review by the DCP pipelines team. The workspace will then be announced to the DCP and HCA community for review.
1. During review, we may request instructions and code to read the output file into a common scientific computing language as a sparse or dense array.
8. Respond to the community and DCP concerns, and update the submission as requested.
9. When the pipeline has met the criteria for consideration and has been approved by the DCP and community, the DCP pipelines team will tag the workspace "HCA-contributed-pipeline". The DCP will then determine when it can be prioritized to be pulled into production, and will communicate the timeline and expectations with the contributor and community (see Section 3.3, Prioritizing Contributions, below).
10. Running pipelines in the cloud and in Terra requires a google cloud billing project. If you have interest in contributing and have difficulty obtaining a google cloud billing project, please reach out via the above email.

#### Why Terra?
Contributing pipelines via Terra has a few benefits:
1. Ensures the pipeline can be run in the DCP pipeline infrastructure
2. Provides support for contributors developing and testing the pipelines
3. Makes the pipelines immediately available to the community, where users can run the pipelines on HCA data, in the case where they may be useful but not immediately able to be accepted to the DCP, for example if they haven’t met an acceptance criteria, or aren’t prioritized into production because of a reason in section 3.3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes the pipelines immediately available to the community...of Terra users. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the pipeline is public in Terra it's accessible to anyone, and anyone could make an account if they wanted to clone and run it. If the person making the pipeline is savvy enough, they could submit it through Dockstore and then show it running in Terra. If you'd like, we could add that information to this process?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some contention among the authors on this point. An alternative proposal for contribution took the following form:

  1. The DCP team sources contributions by opening issues for specific assays. Community members can also open issues to discuss potential contributions that are not sought by the team or on the roadmap.
  2. Contribution occurs via PR to humancellatlas/skylab
  3. Contribution require users to write tests that match and pass skylab CI, which requires (i) construction of docker containers, (ii) specification of test inputs and (iii) specification of expected results

Some considerations:

  1. This requires contributors to understand the github issue>fork>pr open source contribution process, which may have both positive and negative impacts
  2. Users need not have access to google cloud resources, because they are provided by skylab CI.
  3. Contributors are rewarded for their efforts with commits recorded directly against the DCP pipelines code, motivating direct scientific collaboration on our pipelines product.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the more targeted approach of "DCP...opening issues for specific assays". I think this also will invite more community input & discussion, which will result in fruitful scientific discussions around the pipelines, especially when 2+ are proposed, and ultimately community buy-in.

4. Enables the contributor to test the pipeline on HCA data and demonstrate functionality and performance

The guidelines drafted above describe the contribution process as user facing. When we publish them officially, we will include links to tutorials and more background on the contribution requirements outlined in this document.

### Prioritizing contributions
It will take some effort on behalf of DCP's engineers and computational biologists to assist external developers in adapting their pipelines for use in the DCP. As a result, the order of incorporation for contributed pipelines will consider factors like:

- The amount of data in the DCP for that assay.
- The rate at which new data for that assay is being added, based on the HCA Data Roadmap.
- Relative difficulty of adapting the pipeline to run in the DCP.
- Value to the community in exposing this data to users (can it be made available in a different service?)
- Risk associated with inaction (will we lose the ability to include data in the reference atlas?)
- The amount of external developer bandwidth to support pipeline incorporation.
- HCA Community feedback (for example, polls of the HCA community that ask them to rank the 5 assays they think will be most important in the coming year).
- Whether there is an existing pipeline in the HCA that serves the same data as the contributed pipeline.
- Support for this data type in HCA metadata.
- Inclusion of the pipeline in the HCA Scientific Roadmap or communication from HCA scientific stakeholders that the assay is a priority.

When a pipeline is prioritized for incorporation into the DCP, the contributor will be contacted and reminded of the requirements outlined here prior to running in production. The pipeline will be documented on the HCA Data Portal, where the author will be cited for their contribution. Additionally, analysis data produced by the pipeline will be labeled with minimally the pipeline author and contact info, and a permanent link to the pipeline reference workspace in the HCA analysis metadata.

## Operating pipelines
### Responding to Failures
Operating any DCP pipeline on relevant data may result in occasional pipeline failures. The pipelines team will use the following procedure to resolve these failures:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little nervous that the wording around this section implies that pipeline failures should be dealt with "as soon as possible." The reality is that as you grow the number of pipelines, you are going to get more failures and each failure is going to take longer to debug because there are increasing number of code paths that could have caused the issue.

We need some notion of SLOs that sets expectations on when a user might be able to expect data processed by a broken pipeline (that is hopefully pretty generous) and prioritize AWG-vetted pipeline fixes above anything else. I'm not sure if there's a particular team that will be delegated to pipeline maintenance but the eng team will likely have to juggle maintenance with other priorities in the DCP which is something to be mindful of.

1. The pipelines team will take action to debug the workflow, determining if the workflow has failed due to the infrastructure, the input data, or a bug in the pipeline. The pipeline will be restarted in production as soon as possible.
1. An appropriate timebox for debugging issues in the pipeline or inputs will be established.
2. Should the team be unable to debug the failures, the team will reach out to the contributor via email to assist in debugging.
3. The contributor will be expected to work with the team to debug and resolve the failure.
4. If the pipeline is failing suddenly and/or regularly (>2% of workflows/quarter failing) due to qualities of the pipeline itself, the pipeline will be paused in production until the issue is resolved or consider decommissioning.

### Decommissioning Pipelines
Supporting pipelines that no longer provide value to the DCP represents an unnecessary cost for the Data Processing Service. The following events can trigger an evaluation of whether a pipeline should be decommissioned:

- No new data produced in 12 months
- Pipeline is one of several competing pipelines for an assay and consensus is reached that one of the other pipelines is preferred.
- A requirement for inclusion ceases to be met (see “Standards for consideration” section)
- Operational failure rate in a quarter surpasses 2% .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, what's the current failure rate for DCP pipelines & what's the quarterly history since we started using them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look into those numbers, and add them here.

One thing to keep in mind is that even if the failure rate is higher than 2%, with standard pipelines we have made in the DCP we have thorough understanding of the code, analysis, and data and so it is much easier to debug and update. In comparison to contributed pipelines we may only minimally understand and have to do more time consuming communication with the contributor to get debugged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2% failure rate is very high for any high volume pipeline, but acceptable for low volume pipelines. I would suggest tiers for acceptable failure rates based on the number of pipeline invocations and length of time they have been running for. With respect to the length of time, I would suggest a burn-in period during which a higher fail rate is acceptable. After that failure rate should be below 1% for all pipelines and much lower for high volume pipelines.

- Contributor is not responsive to requests for help debugging operational failures.
- A standard pipeline is instantiated and supports the same use cases.
- A contributor asks for the pipeline to be decommissioned.

The DCP reserves the right to deny updating a pipeline should there be a new version (e.g. if the update is not a priority, or the update does not support a current DCP user need).

## Prerequisite work to enable pipeline contribution
The DCP needs to fulfill the following capabilities before we can begin to accept contributed pipelines. These systems could be very light-weight to start (e.g. we could manually email users failure logs)

- The DCP must distinguish between contributed pipeline data and releasable data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I need some clarity here. Though I think this is touching on another significant point where confusion is easily introduced to our users and there's no clear plan for how to avoid confusion: results from many pipelines available for one dataset.
If one set of data is run 6 different ways, how are those results presented to the user? Some users will just want 1 set of results and don't care about what software was just, so how straightforward will it be for them to grab the "preferred" results? Who determines what are "preferred" or "default" results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! This same problem comes up with just having new versions of analyses in the DCP even with standard pipelines. We'll have to determine how we want to present that information to users as well.

I'd love to see us not spend too much time on having many pipelines for one data type. Possibly 2 but once we get to 3 I'd recommend we look into deprecating one of them.

Although in that situation, given we don't delete old analyses, we'll still have many analysis results around in the DCP, even when the pipelines are deprecated.

We would definitely ensure that the user experience of having results from contributed pipelines and results from multiple pipelines is thought through and addressed if this RFC is approved.

Releases of HCA data would not have more than one set of analysis results for each set of data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have more than one analysis available for a dataset we will need an interface that allows users to browse the alternative analyses. Once we have that there is no reason not to keep multiple versions of the processing results. So we should either stick with 1 or allow many, I don't think keeping the numbers low helps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jahilton I appreciate your perspective here in this and other comments reviewing this RFC specifically highlighting the confusion that would be caused by having more than 1 pipeline for an assay. I am very much aligned with you and think there would need to be very dramatic value before such a scenario would be considered. The one scenario that was convincing to me was the scenario of sunsetting a pipeline. Given a current pipeline in production that is going to be changed over to a drastically new pipeline, it may be helpful to run both pipelines so those using the data can confirm they are comfortable with the new pipeline. (Of course, we will also have reports to point to as well). This may also allow the completion of processing of data for a snapshot or a distribution with the soon to be legacy pipeline as the new pipeline is brought in. This RFC keeps open the possibility but if you would feel more comfortable we can communicate this is more expect as a rare event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimothyTickle Addressing the specific case of sunsetting a pipeline in favor of a drastically different pipeline, I think the majority of users are going to expect the DCP to do the confirmation for them (the reports that you mentioned). If we expect some portion of the community to assist in the validation of a new pipeline, then that should process should be explicitly laid out. But if the DCP is switching to a new pipeline, then I expect it will be for good reason, and we won't cling on to the old pipeline.
I would expect the DCP handle sunsetting a pipeline in favor of a new one similar to how the DCP will handle sunsetting a pipeline version in favor of a new version of the same pipeline (and thus, there will be multiple sets of analysis products, but only one produced by a current pipeline). Results from all pipeline versions would be accessible, but only results from a single in-use pipeline would be 'preferred'. I recognize that this scenario is inevitable, and that the specifics of how to communicate results to the user have not been decided. I do not think this RFC is the venue to make those decisions.
So I am specifically concerned about data that has more than one in-use pipeline at a given time. Either specifying that such a case will not occur, or giving more thought on how the DCP would handle that case through the users' eyes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jahilton We are very aligned here and I have no strong concerns in removing the possibility of multiple pipelines for an assay in rare events. Operating 1 pipeline for 1 assay is the way I envision the DCP working. Unless there are other strong reactions to this, happy to have that language removed. Pinging a couple of people historically involved in this conversation to see if they still have strong feelings here. @jonahcool @ambrosejcarr @mckinsel

- The DCP must confirm that users understand that the DCP has not validated contributed pipelines or the resulting analysis data.
- Create a system to provide pipeline failure logs to pipeline contributors to make them aware of failures and enable them to debug the problems.

## Ongoing DCP work needed to support pipeline contribution
The following deliverables may be needed to support integrating contributed pipelines up to the DCP.
- Translate the description of the data that should be analyzed into an appropriate query.
- Connect pipeline to HCA infrastructure to run in production and confirm that the pipeline runs as expected.
- Confirm that a completed pipeline execution contains a record of pipeline provenance.
- Communication with contributors in a timely manner about operational troubleshooting.
- Create documentation describing pipelines.
- Review contribution workspaces.
- Ensure metadata in the HCA describes this data sufficiently.
- Ensure the outputs of these data can be served to users by the matrix service, data portal, DCP CLI, and other DCP services as appropriate.
- Ensure ingest can validate the input data.
- Ensure the data store can support the data.
- Decommissioning pipelines quarterly as needed.

## Productionizing pipelines
This document describes the characteristics that a pipeline must meet to be considered eligible for inclusion in the DCP. Contributed Pipelines may be further developed into "Standard Pipelines", which are engineered for efficiency by DCP engineers and vetted for scientific excellence with the Analysis Working Group. The pipelines team and AWG are responsible for Standard Pipelines. The contributor will be cited for the pipeline and consulted for feedback on the benchmarking. The contributor will no longer be responsible for debugging failures.

### Unresolved Questions
- Are any other DCP components concerned with this proposal / have ideas on how to protect themselves from undue operational burden?
- As we implement this process, we expect to iterate on it based on user and DCP feedback