Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AUDR analysis updates RFC #91

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rfcs/images/0000-new-analysis.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rfcs/images/0000-new-bundle-relationships.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rfcs/images/0000-no-update.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rfcs/images/0000-update-analysis.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rfcs/images/0000-update-bundle-relationships.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added rfcs/images/0000-update-file-relationships.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
125 changes: 125 additions & 0 deletions rfcs/text/0000-audr-analysis-updates.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
### DCP PR:

***Leave this blank until the RFC is approved** then the **Author(s)** must create a link between the assigned RFC
number and this pull request in the format:*
`[dcp-community/rfc#](https://github.com/HumanCellAtlas/dcp-community/pull/<PR#>)`

**2019-09-06: This RFC is paused. It requires a reconsideration of whether it's correct to specify analysis version through
bundle versioning. This is placing high demands on analysis and issues surrounding concurrency in ingest.**

# AUDR Analysis Updates

## Summary
Enable analysis to update previous analyses and add new analyses alongside existing analyses.



## Author(s)
[Justin Clark-Casey](mailto:[email protected])

## Shepherd
[Parth Shah](mailto:[email protected])

## Motivation
This covers [use cases](https://docs.google.com/document/d/1rI8PCASomdAHznyWQceRJGv4-wtg-P5Rm-rI_uJLjtE/edit#):

* **Fix an incorrect metadata field (e.g. cell count)** where the change can trigger re-analysis.
* **Fix an incorrect ontology term** where the change can trigger re-analysis.
* Fix incorrect analysis using a corrected pipeline.
* **Add better analysis using an improved pipeline.**

Wranglers consider the **bolded** use cases to be of greater importance than the other use cases.

## Detailed Design
A primary metadata update triggers a datastore notification. Per pipeline, Analysis determines what should happen:

* **Update analysis.** The [meta]data change is in a file or parameter that updates an existing analysis (a “consequential” change).
* **New analysis.** The [meta]data change should create a new analysis. For example, the species changes from mouse to human and there is a pipeline for human on this data but not one for mouse. It may be that an analysis from a different pipeline already exists and this should not be superseded by the new analysis.
* **No update.** There is an existing analysis for this pipeline but the [meta]data change does not affect it (an “inconsequential” change). For example, a change in the project description.

The **update analysis** and **new analysis** scenarios can occur without a primary update. For example, a pipeline bug may be fixed which requires updates of the analyses it performed. Or a new pipeline may be created that can be run on existing data, where the existing analyses from other pipelines are still valid and so should not be superseded.

justincc marked this conversation as resolved.
Show resolved Hide resolved
### Update analysis

![](../images/0000-update-analysis.png)

A bundle for an updated analysis will have the same UUID as the existing analysis. This will tell consuming components that the bundle with the later timestamp in its version supersedes those with earlier timestamps. The bundle relationship looks like this:

![](../images/0000-update-bundle-relationships.png)

To submit an updated analysis, Analysis will:

1. Create a submission envelope with the parameter isUpdate = true.
2. Add `analysis_process_0.json` with provenance.document_id set to the same UUID as the existing `analysis_process_0.json`
3. For other files that need updating, also add them to the envelope with the same provenance.document_id as their previous versions.
4. For files that only exist in the new analysis, add them to the envelope with a new provenance.document_id.
5. For files that exist only in the older analysis, delete them by posting an HTTP DELETE to their endpoint in this envelope.
6. For files that are in both old and new analysis but don’t need updating, do nothing. These files will have the same
UUID and version in both the old and new analyses.

Note that a submission envelope is a transaction that can add, delete or update any number of files in a project.

![](../images/0000-update-file-relationships.png)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to describe that updated files are not really updated but deleted and then an independent new file is created and why. Also that this is a short term compromise to simplify implementation at the expense of a more complex data mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears this is no longer the case and the figure just needs updated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an updated analysis the figure tries to show the possible scenarios of a) not changing a file, b) updating a file, c) adding a file, d) deleting a file. But it sounds like this isn't easy to see and I should improve the diagram?

On the content, yes on further consideration deleting and re-adding all the other files seems a worse idea since it means some special case work and I'm not sure it ultimately helps analysis much. This is a change from what was discussed so I need feedback from @rexwangcc and @samanehsan.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justincc can you clarify why the old files need to be deleted in order to add analysis files that have new uuids?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samanehsan Because everything in the previous version of that bundle will automatically carry forward.

Copy link

@samanehsan samanehsan Aug 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you confirm that 2 analyses triggering in quick succession is a scenario that's not sufficiently theoretical for us to ignore temporarily?

@justincc I don't think the scenario is unrealistic because it could happen if a primary submitter makes/updates a submission even an hour or two apart. Looking at this production workflow as an example, after analysis creates a submission envelope in ingest it takes 2 hours to upload the files and confirm the submission. So if an update came in less than 2 hours later, we would have this "simultaneous" submission problem.

Also in this case, the confirm submission step only took 4 minutes but the workflow can wait up to 2 hours for the submission to validate. And then there is additional time for the submission to go through ingest and appear as a results bundle in the data store. So in the worst-case scenario, "quick succession" actually means 2+ hours.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justincc, this is the concern that @rexwangcc and I mentioned in tech-arch today ^

Copy link
Author

@justincc justincc Aug 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the answer here is that ingest may need to do complex engineering to ensure non-interference between submissions. Does that address the issue?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what that will look like in ingest? Is the idea that analysis would still specify which files we think are new/updated/deleted and then ingest confirms? That seems like we would be duplicating effort.

Copy link
Author

@justincc justincc Aug 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment there's no plan for ingest to confirm anything, it just does what you say or some known interpretation thereof. If it's an update then the new update (which is given as the file, not a delta) will replace whatever the old version was. If it's addition of a file that already exists then it could just replace what is already there if this is the easiest behaviour. Or if it's a deletion of something that's already deleted then we could ignore that if this is the easiest behaviour.

To make it easier for analysis to supply the existing UUIDs, they will also be able to set the provenance.document_id when the file is first created.

### New analysis
![](../images/0000-new-analysis.png)

This is the same as the new analysis implementation today which is triggered by a new primary data bundle rather than an update. If the new analysis should not supersede an existing analysis, for example if that existing analysis comes from a different still valid pipeline, then the relationship between such bundles will look like this:

![](../images/0000-new-bundle-relationships.png)

### No update
![](../images/0000-no-update.png)

To perform copy-forward, Analysis tells Ingest that the input primary bundle for analyses is updated.

The update mechanism will be the same as for actually updating the analysis, except that they will change only the input_bundles parameter when posting the `analysis_process_0.json.`

## Acceptance criteria
* Analysis can update existing analyses.
* Analysis can submit new analyses which do not supersede existing analyses.
* All primary data and metadata in analysis bundles is updated in response to a primary data or metadata change.

## Alternatives

### Only analysis_process_0.json requires same UUID on updates
Instead of requiring all file actions to be specified when submitting an analysis update (update, delete or create), Analysis would only need to specify the same UUID for `analysis_process_0.json`. All other files could be specified with new UUIDs. At the bundle level, all the original files bar `analysis_process_0.json` would be deleted and new files and references to files created.

![](../images/0000-analysis-process-only-file-relationships.png)

The advantage here for analysis is that only one relationship needs to be specified. If the UUID could be generated from other parameters, it may mean Analysis doesn’t need to store UUIDs or any other information about files generated by a previous analysis.

The problem here is that actions are still required to delete all those old files before creating them anew. If Analysis does this, then it needs to know the file UUIDs anyway. If Ingest provides a convenience method, such as delete all the files **except** for `analysis_process_0.json`, these opens the way for bugs and arguably starts to put some intelligence about bundle changes into Ingest.

There seems to be a better separation of responsibility if Analysis is solely in charge of maintaining file relationships between bundles. This also means that this information will be present from the outset - it’s possible that file relationship information is valuable to preserve.

### Ingest copy-forwards immediately
Instead of Analysis telling Ingest to perform a copy-forward on no analysis, Ingest performs copy-forward automatically on all bundles as soon as it receives a primary update. If an analysis is updated then Ingest needs to manage this so that it a) doesn’t clash with any analysis bundles already going on from the initial primary update and b) both the analysis update and the copy-forward are correctly performed on the updated analysis bundle.

**PROS**
1. It’s less work for all non-primary submitters. They don’t need memory and mechanisms to signal the no analysis case.

**CONS**
1. Ensuring serialization of bundle updates becomes critical. We do not want to risk interleaving a copy-forward with a possible analysis.
2. Ingest will need to do much more work, meaning greater resource consumption (or crowding out) and higher fragility. Any primary data update will start bundle updates that need to be repeated if an analysis comes in.

### Analysis declares fields of interest

Analysis pre-declares to Ingest the metadata fields on which it will update analyses. If a primary update comes in that changes none of those fields, Ingest goes ahead with copy-forward on the analysis bundles automatically.

**PROS**
1. Less work for non-primary submitters. They don’t need mechanisms to hold and convey no updated analysis information.

**CONS**
1. Ingest needs to hold and preserve this information. It needs to provide an authenticated CRUD API for declarations.
2. Analysis needs to declare fields of no interest to ingest. This is a fragile because Analysis may accidentally not declare some fields or over-declare and never trigger copy-forward.

# References
* [Add new data and metadata to existing projects (AUDR) #414](https://app.zenhub.com/workspaces/dcp-5ac7bcf9465cb172b77760d9/issues/humancellatlas/dcp/414)
* [Results from “The AUDacity of Hackathons” (June 2019 HCA DCP F2F session)](https://docs.google.com/document/d/1ifzy4lQtdIm0NvDpEuuQ2tD8ku9IPvn7sPjOg9BhliA)
* [Data model pre-RFC](https://docs.google.com/document/d/1CdIJ_pBdiiQUFh8kYARRhMXvhKXGERG3bdv763tpAKI/)
* [Tech Arch: Handle notifications for insignificant bundle updates](https://docs.google.com/presentation/d/1_snBHeCVPjLmgZPKH2SbHk_I3-s6jLlEkPtz0eUY_Xc/edit?ts=5cfa7ddc)
* [AUDR usecases](https://docs.google.com/document/d/1rI8PCASomdAHznyWQceRJGv4-wtg-P5Rm-rI_uJLjtE/edit)
* [Analysis updates and Q2 testing (AUDR) 2019-07-17](https://docs.google.com/presentation/d/1iZOoVYy4t3cG0VUMEj0swe9BfGJJ4_G1L4XOgvxSRvY)