Define reanalysis bundling strategy #236

mweiden · 2018-12-07T18:55:07Z

Need

Ensure that new analysis results resulting from reanalysis of existing data are propagated to the datastore and take precedence over the previous analysis results.

Scoping Task

Finish and refine the spec for Updating metadata. Optional: publish it as an RFC in the dcp-community repo.
Help teams that have work resulting from the RFC define tickets.

The text was updated successfully, but these errors were encountered:

justincc · 2018-12-21T20:33:10Z

As discussed (though not yet with you @morrisonnorman! Let's talk after the hols) I am bouncing this to @morrisonnorman to really nail down the requirements before taking up the technical aspects.

justincc · 2018-12-21T20:33:42Z

Linked to #234

theathorn · 2019-01-31T01:39:08Z

Linked to #224.

justincc · 2019-06-26T11:50:05Z

The current strategy is that we simple re-ingest the whole project. This will lose any provenance. This may be tweakable by at least keeping the same project UUID even if all the bundles are completely different. This will require tombstoning the old bundles or updating them to show they are now obsolete (as per @diekhans).

theathorn · 2019-06-26T15:42:26Z

I am concerned that re-ingesting a whole project and tombstoning the old data would make data citations (see #226) not viable. i.e. Scientists who cite DCP data in publications expect the reference to that data and the cited data itself to remain unchanged. I would like to hear Tech Architecture's view on this problem.

xbrianh · 2019-06-26T16:49:42Z

Our update process should probably not have the side effect of making data disappear. This is problematic for citations, as @theathorn noted, and I don't think it's technically necessary.

diekhans · 2019-06-26T17:00:27Z

I was under the impression that reingestion was a one-time thing.

One we go public next month, it would be really bad to reingest. it is far better to have inconsistent metadata and project that do not update than to lose provenance.

However, note that is is about reanalysis, which may or may not get a new UUID dependent on policy. However, the previous analysis must remain.

claymfischer · 2019-06-26T17:31:11Z

In a perfect world this recent reingestion would be the last time, but realistically full implementation of AUDR won't be available immediately. There may be circumstances requiring some amount of reingestion we are unable to predict at this time.

From this point forward, if a situation occurs requiring reingestion, it is imperative that anything released 'to the wild' never disappears. It must be citable and accessible at a URL - ideally in perpetuity.

Mike Cherry (@jmcherry) has also emphasized this.

diekhans · 2019-06-26T17:57:00Z

@claymfischer what are the circumstances that would require reingestion?

sampierson · 2019-06-26T17:59:18Z

I think the big 3 are:

Metadata correction
Data correction
Pipeline correction/changes
Am I close?

claymfischer · 2019-06-26T18:00:16Z

"There may be circumstances requiring some amount of reingestion we are unable to predict at this time." - maybe something an early AUDR implementation is unable to resolve.

diekhans · 2019-06-26T18:06:45Z

Am I close?

Very. although I would argue that even in these cases unless it is scientifically really bad, to avoid reingestion. If desperate, it would be "additional ingestion". Create a newer project and leave the old one there. Maybe avoid displaying in the browser.

Redaction of data, well, you do purge the data and leave the metadata. Do manually.

Redaction of metadata (yes this can happen); Manually edit the DSS .

xbrianh · 2019-06-26T19:21:53Z

Redaction of data, well, you do purge the data and leave the metadata. Do manually.

Redaction of metadata (yes this can happen); Manually edit the DSS .

Would an example of redaction be data where consent has been revoked? My understanding is that the DSS should support permanent deletion for these cases.

DSS objects may be updated with a new version, tombstoned, or deleted. Other than that, I'm not sure what "editing" would mean.

jahilton · 2019-06-26T19:42:29Z

I think the big 3 are:

Metadata correction

Data correction

Pipeline correction/changes
Am I close?

Schema updates also require re-ingestion?

diekhans · 2019-06-26T21:31:58Z

Schema updates also require re-ingestion?

Currently, there is no fully working mechanism for update metadata or data in the system

To try to summarize my view since this is drifting.

Re-ingesting was a short-term hack to move various development activities forward. It is neither part of a design or an SOP.
Once the site goes officially live next month, we should not reingest unless it is an emergency. Even then, we need to keep the old data around (unless redacted), despite any pain and confusion.
We need to all work towards getting a complete update mechanism working to support scientific requirements.

diekhans · 2019-06-26T21:36:35Z

Would an example of redaction be data where consent has been revoked? My understanding is that the DSS should support permanent deletion for these cases.

Yes, redaction is when we are not legally allowed to have the data. Data never being properly consented in the first place. Personally identifiable information accidentally being added to metadata sometimes happens.

In all cases, we need to remove the data completely, but still being able to inform users it was redacted, rather than just `not found'.

theathorn · 2019-06-26T21:46:54Z

@diekhans "reingest ... we need to keep the old data around". Not sure how that would work. We don't want 2 copies of a project (i.e. the same project with different project UUIDs) in the "live/latest" view as they will be counted twice in the Data Browser. Proper versioning would allow us to show the "latest" and "previous releases" views (Azul currently only indexes the latest version of files but we have started discussing the needed changes to support versioning / data releases).

diekhans · 2019-06-26T21:54:43Z

@theathorn I would work rather rather badly. The browser would have to put in some kind of ugly hack to not display the old project, like the uuid hard coded?

But this would be an emergency anyway and it would go away

justincc · 2019-06-27T12:42:16Z

@diekhans I accidentally thought this was about re-ingestion. It is indeed about re-analysis which would either be a new analysis where >=2 analysis bundles would have the same parent primary bundle or reanalysis (as is the case here), which will be an updated where an existing bundle in the datastore gets a new version. In both cases it is up to analysis to tell us what they want and there will be a mechanism for them to do so.

justincc · 2019-06-27T12:50:35Z

If there is experiment structure change then the only current available action, as @lauraclarke has previously said is to re-ingest. We could probably make the new structure a new submission to the project and tombstone the previous submission, which would keep the project UUID. This will still break provenance unless the reason for change is kept in a file that remains in both submissions such as the file that stores the project metadata. If we want to explore this solution then it should be added to 2019Q3 AUDR asap unless it's acceptable to push off to Q4, since we will need to do some work on preserving project UUID.

justincc · 2019-06-27T12:56:59Z

@jahilton We are coming up to the point where simple metadata updates (those that are changing existing fields) will start becoming reality. In Q3 we are scheduled to also allow addition of data (and implicitly I guess data update - this is not something that was in Q2!). Add/remove of fields (schema update) is probably a simple addition from ingest's point of view, albeit sometimes one with a large delta. The big issue is structural change where it's extremely difficult/impossible to establish relationships between changing bundles.

This was explored in the Feb 2019 F2F

justincc · 2019-07-15T12:53:35Z

Included within #416

mweiden assigned justincc Dec 7, 2018

mweiden added the 🔬scoping task label Dec 7, 2018

theathorn added Spike Agile Spike Triaged and removed Triaged labels Feb 25, 2019

theathorn removed the 🔬scoping task label Mar 7, 2019

justincc closed this as completed Jul 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define reanalysis bundling strategy #236

Define reanalysis bundling strategy #236

mweiden commented Dec 7, 2018

justincc commented Dec 21, 2018

justincc commented Dec 21, 2018

theathorn commented Jan 31, 2019

justincc commented Jun 26, 2019

theathorn commented Jun 26, 2019

xbrianh commented Jun 26, 2019

diekhans commented Jun 26, 2019

claymfischer commented Jun 26, 2019

diekhans commented Jun 26, 2019

sampierson commented Jun 26, 2019

claymfischer commented Jun 26, 2019

diekhans commented Jun 26, 2019

xbrianh commented Jun 26, 2019

jahilton commented Jun 26, 2019

diekhans commented Jun 26, 2019

diekhans commented Jun 26, 2019

theathorn commented Jun 26, 2019

diekhans commented Jun 26, 2019

justincc commented Jun 27, 2019

justincc commented Jun 27, 2019 •

edited

Loading

justincc commented Jun 27, 2019

justincc commented Jul 15, 2019

Define reanalysis bundling strategy #236

Define reanalysis bundling strategy #236

Comments

mweiden commented Dec 7, 2018

Need

Scoping Task

justincc commented Dec 21, 2018

justincc commented Dec 21, 2018

theathorn commented Jan 31, 2019

justincc commented Jun 26, 2019

theathorn commented Jun 26, 2019

xbrianh commented Jun 26, 2019

diekhans commented Jun 26, 2019

claymfischer commented Jun 26, 2019

diekhans commented Jun 26, 2019

sampierson commented Jun 26, 2019

claymfischer commented Jun 26, 2019

diekhans commented Jun 26, 2019

xbrianh commented Jun 26, 2019

jahilton commented Jun 26, 2019

diekhans commented Jun 26, 2019

diekhans commented Jun 26, 2019

theathorn commented Jun 26, 2019

diekhans commented Jun 26, 2019

justincc commented Jun 27, 2019

justincc commented Jun 27, 2019 • edited Loading

justincc commented Jun 27, 2019

justincc commented Jul 15, 2019

justincc commented Jun 27, 2019 •

edited

Loading