Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contributed Pipelines RFC #116

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

kbergin
Copy link
Contributor

@kbergin kbergin commented Sep 17, 2019

This RFC outlines a new process for pipelines coming into the DCP. It includes proposals for technical and scientific requirements of those pipelines and a draft of user facing contribution guidelines.

Status: Community Review
Last call for community review: October 8th

cc @ambrosejcarr @TimothyTickle

Copy link
Contributor

@jahilton jahilton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds very much in line with Aim 1 from the first draft of the DCP Strategy, but the PLs revised that to a Strategy that drastically de-emphasizes contributed pipelines. In general, this is a LOT of work that adds value for a very limited number of users, and without clear plans laid out to prevent confusion and uncertainty, it actually risks reducing value for many users.

This RFC discusses (1) technical and scientific standards to determine when pipelines contributed by community members are ready for inclusion in the DCP (2), a potential contribution process, (3) the prioritization of such pipelines, and (4) off-boarding pipelines when they lose value or cease to fulfill standard requirements.

## User Stories
- As a user of the DCP (both user archetypes, but researcher with a pipette especially), I am looking for all raw data in the HCA DCP to be analyzed. I trust the scientific community to write high quality pipelines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This user story can be accomplished by taking contributed pipelines only for data types that don't have an existing pipeline available, which I am interpreting to be a much more limited scope that the current RFC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current RFC doesn't explicitly state that we won't accept pipelines for data types that already have other pipelines, but it does state that the choice to prioritize adding a contributed pipeline would be based on many factors including "Whether there is an existing pipeline in the HCA that serves the same data as the contributed pipeline."

If we had multiple pipelines contributed for the same data type we also stated that we could deprecate one of them because "Pipeline is one of several competing pipelines for an assay and consensus is reached that one of the other pipelines is preferred."

I agree that having many pipelines for one assay could be confusing to users if not done carefully. I personally would be comfortable with trying to limit it, at least at first, but I know that adds risks of choosing favorites, or assigning winner to the pipeline that got there first, which some folks who helped with this draft didn't love the idea of.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that 'winner based on who got there first' should be avoided. like Ambrose's idea below of "The DCP team sources contributions by opening issues for specific assays". This may mitigate that concern by 'opening a call', setting deadlines, and allowing valuable scientific discussions weighing different pipelines.
So I do find this an incredibly compelling User Story (and far more compelling than the other two), but I envisioned a less riskier proposal with tighter bounds and still fulfills this User Story.

- Provide pipeline outputs using standard formats (if such standards exist)
- Utilize public, open source tools.

Data produced by Contributed Pipelines will be clearly marked as non-release data, to distinguish it from data eligible for release that was generated by AWG-vetted pipelines. Not all DCP services will be available for non-release data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify this? The way I read it is that the DCP will take pipelines but never release any results that was generated by them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've interpreted this correctly, that the analysis results would be available in the DCP but not in any HCA releases.

Do you have a suggestion on how to write it more clearly? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, based on the definitions in the DataOps charter (not approved yet, but we're trying to set some terms), this is intended to be read something like...

Data produced by Contributed Pipelines will be accessible to the community, but will be clearly distinguished from data produced by AWG-vetted pipelines and will not be eligible for inclusion in HCA Snapshots or other data distributions. There may be other DCP services that also include only data from AWG-vetted pipelines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm still confused about the "clearly marked" portion of this sentence. Where is the marking ideally located?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should probably be clear on which DCP services will have this data from non-AWG-vetted pipelines to the user... how would we best designate that?

Although the characteristics of contributed pipelines are not as well understood as standard HCA pipelines, we should takes steps to avoid installing a pipeline into production that could produce misleading or bad scientific data. To accomplish this, contributed pipelines should demonstrate that they have been vetted by members of the scientific community, in addition to meeting the technical requirements for consideration. Contributed pipelines must be shown to produce meaningful scientific results, a requirement that is met by any of the following:
- Has produced data that is used extensively in analysis found in a published, peer-reviewed manuscript.
- Produces data that is shown to replicate analysis found in a published, peer-reviewed paper.
- Is a known pipeline to an AWG member who is willing to vouch for the pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This puts the statement above about avoiding misleading/bad scientific data at great risk. If that is the goal, then requirements should be purely scientific to reflect that.

- Has produced data that is used extensively in analysis found in a published, peer-reviewed manuscript.
- Produces data that is shown to replicate analysis found in a published, peer-reviewed paper.
- Is a known pipeline to an AWG member who is willing to vouch for the pipeline.
- Is a pipeline that is used by 3 or more experts in the field, all of whom confirm they have used it to successfully analyze data and can point to that analysis.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is drifting from scientific validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this proposal, we don't suggest a requirement for scientific validation per se, or at least not in the same way we do validations on the pipelines team. We tried to identify ways that science is validated in the scientific community as reasonable markers for scientific validation for this process.

In your experience, do you think this and the bullet above wouldn't be sufficient to trust the scientific validity of a pipeline? Do you have ideas on other requirements we could add that could help us trust the scientific validity of the pipelines? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to trust this bullet point, I really do. But I also drift towards worst-case scenarios. I do recognize that the idea is to lower the bar for data to get processed and it's a tough balance between 'users want more data' and 'users want reliable data'. Maybe instead of just 'used', some specification of 'used in X high profile peer-reviewed studies'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and a lot of this ultimately comes down to how it is messaged to the data consumer

Contributing pipelines via Terra has a few benefits:
1. Ensures the pipeline can be run in the DCP pipeline infrastructure
2. Provides support for contributors developing and testing the pipelines
3. Makes the pipelines immediately available to the community, where users can run the pipelines on HCA data, in the case where they may be useful but not immediately able to be accepted to the DCP, for example if they haven’t met an acceptance criteria, or aren’t prioritized into production because of a reason in section 3.3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes the pipelines immediately available to the community...of Terra users. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the pipeline is public in Terra it's accessible to anyone, and anyone could make an account if they wanted to clone and run it. If the person making the pipeline is savvy enough, they could submit it through Dockstore and then show it running in Terra. If you'd like, we could add that information to this process?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some contention among the authors on this point. An alternative proposal for contribution took the following form:

  1. The DCP team sources contributions by opening issues for specific assays. Community members can also open issues to discuss potential contributions that are not sought by the team or on the roadmap.
  2. Contribution occurs via PR to humancellatlas/skylab
  3. Contribution require users to write tests that match and pass skylab CI, which requires (i) construction of docker containers, (ii) specification of test inputs and (iii) specification of expected results

Some considerations:

  1. This requires contributors to understand the github issue>fork>pr open source contribution process, which may have both positive and negative impacts
  2. Users need not have access to google cloud resources, because they are provided by skylab CI.
  3. Contributors are rewarded for their efforts with commits recorded directly against the DCP pipelines code, motivating direct scientific collaboration on our pipelines product.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the more targeted approach of "DCP...opening issues for specific assays". I think this also will invite more community input & discussion, which will result in fruitful scientific discussions around the pipelines, especially when 2+ are proposed, and ultimately community buy-in.

- No new data produced in 12 months
- Pipeline is one of several competing pipelines for an assay and consensus is reached that one of the other pipelines is preferred.
- A requirement for inclusion ceases to be met (see “Standards for consideration” section)
- Operational failure rate in a quarter surpasses 2% .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, what's the current failure rate for DCP pipelines & what's the quarterly history since we started using them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look into those numbers, and add them here.

One thing to keep in mind is that even if the failure rate is higher than 2%, with standard pipelines we have made in the DCP we have thorough understanding of the code, analysis, and data and so it is much easier to debug and update. In comparison to contributed pipelines we may only minimally understand and have to do more time consuming communication with the contributor to get debugged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2% failure rate is very high for any high volume pipeline, but acceptable for low volume pipelines. I would suggest tiers for acceptable failure rates based on the number of pipeline invocations and length of time they have been running for. With respect to the length of time, I would suggest a burn-in period during which a higher fail rate is acceptable. After that failure rate should be below 1% for all pipelines and much lower for high volume pipelines.

## Prerequisite work to enable pipeline contribution
The DCP needs to fulfill the following capabilities before we can begin to accept contributed pipelines. These systems could be very light-weight to start (e.g. we could manually email users failure logs)

- The DCP must distinguish between contributed pipeline data and releasable data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I need some clarity here. Though I think this is touching on another significant point where confusion is easily introduced to our users and there's no clear plan for how to avoid confusion: results from many pipelines available for one dataset.
If one set of data is run 6 different ways, how are those results presented to the user? Some users will just want 1 set of results and don't care about what software was just, so how straightforward will it be for them to grab the "preferred" results? Who determines what are "preferred" or "default" results?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! This same problem comes up with just having new versions of analyses in the DCP even with standard pipelines. We'll have to determine how we want to present that information to users as well.

I'd love to see us not spend too much time on having many pipelines for one data type. Possibly 2 but once we get to 3 I'd recommend we look into deprecating one of them.

Although in that situation, given we don't delete old analyses, we'll still have many analysis results around in the DCP, even when the pipelines are deprecated.

We would definitely ensure that the user experience of having results from contributed pipelines and results from multiple pipelines is thought through and addressed if this RFC is approved.

Releases of HCA data would not have more than one set of analysis results for each set of data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have more than one analysis available for a dataset we will need an interface that allows users to browse the alternative analyses. Once we have that there is no reason not to keep multiple versions of the processing results. So we should either stick with 1 or allow many, I don't think keeping the numbers low helps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jahilton I appreciate your perspective here in this and other comments reviewing this RFC specifically highlighting the confusion that would be caused by having more than 1 pipeline for an assay. I am very much aligned with you and think there would need to be very dramatic value before such a scenario would be considered. The one scenario that was convincing to me was the scenario of sunsetting a pipeline. Given a current pipeline in production that is going to be changed over to a drastically new pipeline, it may be helpful to run both pipelines so those using the data can confirm they are comfortable with the new pipeline. (Of course, we will also have reports to point to as well). This may also allow the completion of processing of data for a snapshot or a distribution with the soon to be legacy pipeline as the new pipeline is brought in. This RFC keeps open the possibility but if you would feel more comfortable we can communicate this is more expect as a rare event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimothyTickle Addressing the specific case of sunsetting a pipeline in favor of a drastically different pipeline, I think the majority of users are going to expect the DCP to do the confirmation for them (the reports that you mentioned). If we expect some portion of the community to assist in the validation of a new pipeline, then that should process should be explicitly laid out. But if the DCP is switching to a new pipeline, then I expect it will be for good reason, and we won't cling on to the old pipeline.
I would expect the DCP handle sunsetting a pipeline in favor of a new one similar to how the DCP will handle sunsetting a pipeline version in favor of a new version of the same pipeline (and thus, there will be multiple sets of analysis products, but only one produced by a current pipeline). Results from all pipeline versions would be accessible, but only results from a single in-use pipeline would be 'preferred'. I recognize that this scenario is inevitable, and that the specifics of how to communicate results to the user have not been decided. I do not think this RFC is the venue to make those decisions.
So I am specifically concerned about data that has more than one in-use pipeline at a given time. Either specifying that such a case will not occur, or giving more thought on how the DCP would handle that case through the users' eyes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jahilton We are very aligned here and I have no strong concerns in removing the possibility of multiple pipelines for an assay in rare events. Operating 1 pipeline for 1 assay is the way I envision the DCP working. Unless there are other strong reactions to this, happy to have that language removed. Pinging a couple of people historically involved in this conversation to see if they still have strong feelings here. @jonahcool @ambrosejcarr @mckinsel

## User Stories
- As a user of the DCP (both user archetypes, but researcher with a pipette especially), I am looking for all raw data in the HCA DCP to be analyzed. I trust the scientific community to write high quality pipelines.
- As a computational biologist or methodologist, I have a pipeline that I would like to see leveraged in the HCA DCP.
- As the HCA DCP, I want to develop pipelines iteratively based on user feedback. To accomplish this, I need new pipelines to be leveraged in the DCP as quickly as possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this a step further and connected it to the data user, and I think frequently-updating analysis results do not add value for most of them. This is one significant point where confusion can easily be introduced to the user, and there is no clear plan for how to handle results from many versions of a pipeline for 1 dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that even with multiple versions of the same HCA standard pipelines we don't currently have great ways to demonstrate that the users have access to many versions of analysis results in the DCP. I would imagine that latest would always be shown in the browser, that a dropdown would let users filter to older versions of analysis results, that users could pin to specific analysis results by a collection manifest, and that releases could also help users pin to specific versions. But those are just ideas and it's not up to me :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the modeling is proposed in this RFC, but I think there's still an access solution missing. But you solution sounds beautifully reasonable to me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to nitpick, but I'd like some clarity on what "leverage" means in this sentence as well. Is it just simply making it accessible? Does it involve other work?

@barkasn
Copy link
Contributor

barkasn commented Sep 19, 2019

Please add a section on code quality standards. There should be a requirement for code to adhere to best practices for the language it is written in. This includes WDL, where the level of granularity for individual tasks required (if any) is specified.

This should also include code documentation standards.

@TimothyTickle TimothyTickle self-assigned this Sep 26, 2019
`[Name](mailto:[email protected])`

## Motivation
As the DCP accepts data from assays, it takes on responsibility for eventually processing that data and returning analysis products to our contributors. The core mission of the HCA DCP is to process and make available the diverse data types comprising the reference atlas.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"returning analysis products to our contributors" -> mostly a nit but we have two sets of users, the contributors and the consumers. In this particular statement, you mention the consumers as being the original contributors but I think the value add is primarily for new data consumers, right?

This RFC discusses (1) technical and scientific standards to determine when pipelines contributed by community members are ready for inclusion in the DCP (2), a potential contribution process, (3) the prioritization of such pipelines, and (4) off-boarding pipelines when they lose value or cease to fulfill standard requirements.

## User Stories
- As a user of the DCP (both user archetypes, but researcher with a pipette especially), I am looking for all raw data in the HCA DCP to be analyzed. I trust the scientific community to write high quality pipelines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"both user archetype" -> Could you be more specific please?

"I trust the scientific community to write high quality pipelines" is, in my mind, a whole new user story around the expectations of quality by the user so I'd recommend to break it out to be more explicit.


## User Stories
- As a user of the DCP (both user archetypes, but researcher with a pipette especially), I am looking for all raw data in the HCA DCP to be analyzed. I trust the scientific community to write high quality pipelines.
- As a computational biologist or methodologist, I have a pipeline that I would like to see leveraged in the HCA DCP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what does it mean to "be leveraged"? Just looking for here specificity here (i.e. they'd like more data than their own to be analyzed? they'd like their data to be used by other data consumers? etc.)

## User Stories
- As a user of the DCP (both user archetypes, but researcher with a pipette especially), I am looking for all raw data in the HCA DCP to be analyzed. I trust the scientific community to write high quality pipelines.
- As a computational biologist or methodologist, I have a pipeline that I would like to see leveraged in the HCA DCP.
- As the HCA DCP, I want to develop pipelines iteratively based on user feedback. To accomplish this, I need new pipelines to be leveraged in the DCP as quickly as possible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to nitpick, but I'd like some clarity on what "leverage" means in this sentence as well. Is it just simply making it accessible? Does it involve other work?

- As the HCA DCP, I want to develop pipelines iteratively based on user feedback. To accomplish this, I need new pipelines to be leveraged in the DCP as quickly as possible.

## Definitions
**Assay:** A biological and technological process that transforms a biological specimen into interpretable raw data, ideally of a known and standardized format. Generates output which must adhere to a specific raw data format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are "standardized format" and "a specific raw data format" referring to the same thing here? I wasn't sure if there's a difference.

- Not have any restrictions on the use of pipeline output(s).
- Have multiple test data available for use by the Data Processing Service to evaluate the validity of the results and serve as a benchmark for future improvements
- Provide acceptable ranges and a method for validation of any required run-specific parameters (example: starfish)
- Provide pipeline outputs using standard formats (if such standards exist)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "such standards" don't exist, do we still accept the pipeline?

- Provide pipeline outputs using standard formats (if such standards exist)
- Utilize public, open source tools.

Data produced by Contributed Pipelines will be clearly marked as non-release data, to distinguish it from data eligible for release that was generated by AWG-vetted pipelines. Not all DCP services will be available for non-release data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm still confused about the "clearly marked" portion of this sentence. Where is the marking ideally located?

- Provide pipeline outputs using standard formats (if such standards exist)
- Utilize public, open source tools.

Data produced by Contributed Pipelines will be clearly marked as non-release data, to distinguish it from data eligible for release that was generated by AWG-vetted pipelines. Not all DCP services will be available for non-release data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should probably be clear on which DCP services will have this data from non-AWG-vetted pipelines to the user... how would we best designate that?


Data produced by Contributed Pipelines will be clearly marked as non-release data, to distinguish it from data eligible for release that was generated by AWG-vetted pipelines. Not all DCP services will be available for non-release data.

These requirements, and in particular the maintenance requirements, should be clearly communicated to contributors, and the DCP should make an effort to verify that the contributors understand.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the method of communication?


## Operating pipelines
### Responding to Failures
Operating any DCP pipeline on relevant data may result in occasional pipeline failures. The pipelines team will use the following procedure to resolve these failures:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little nervous that the wording around this section implies that pipeline failures should be dealt with "as soon as possible." The reality is that as you grow the number of pipelines, you are going to get more failures and each failure is going to take longer to debug because there are increasing number of code paths that could have caused the issue.

We need some notion of SLOs that sets expectations on when a user might be able to expect data processed by a broken pipeline (that is hopefully pretty generous) and prioritize AWG-vetted pipeline fixes above anything else. I'm not sure if there's a particular team that will be delegated to pipeline maintenance but the eng team will likely have to juggle maintenance with other priorities in the DCP which is something to be mindful of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants