Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define reanalysis bundling strategy #236

Closed
2 tasks
mweiden opened this issue Dec 7, 2018 · 22 comments
Closed
2 tasks

Define reanalysis bundling strategy #236

mweiden opened this issue Dec 7, 2018 · 22 comments
Assignees
Labels
Spike Agile Spike

Comments

@mweiden
Copy link
Contributor

mweiden commented Dec 7, 2018

Need

Ensure that new analysis results resulting from reanalysis of existing data are propagated to the datastore and take precedence over the previous analysis results.

Scoping Task

  • Finish and refine the spec for Updating metadata. Optional: publish it as an RFC in the dcp-community repo.
  • Help teams that have work resulting from the RFC define tickets.
@justincc
Copy link

As discussed (though not yet with you @morrisonnorman! Let's talk after the hols) I am bouncing this to @morrisonnorman to really nail down the requirements before taking up the technical aspects.

@justincc
Copy link

Linked to #234

@theathorn
Copy link

Linked to #224.

@theathorn theathorn added Spike Agile Spike Triaged and removed Triaged labels Feb 25, 2019
@justincc
Copy link

The current strategy is that we simple re-ingest the whole project. This will lose any provenance. This may be tweakable by at least keeping the same project UUID even if all the bundles are completely different. This will require tombstoning the old bundles or updating them to show they are now obsolete (as per @diekhans).

@theathorn
Copy link

I am concerned that re-ingesting a whole project and tombstoning the old data would make data citations (see #226) not viable. i.e. Scientists who cite DCP data in publications expect the reference to that data and the cited data itself to remain unchanged. I would like to hear Tech Architecture's view on this problem.

@xbrianh
Copy link
Member

xbrianh commented Jun 26, 2019

Our update process should probably not have the side effect of making data disappear. This is problematic for citations, as @theathorn noted, and I don't think it's technically necessary.

@diekhans
Copy link

I was under the impression that reingestion was a one-time thing.

One we go public next month, it would be really bad to reingest. it is far better to have inconsistent metadata and project that do not update than to lose provenance.

However, note that is is about reanalysis, which may or may not get a new UUID dependent on policy. However, the previous analysis must remain.

@claymfischer
Copy link

In a perfect world this recent reingestion would be the last time, but realistically full implementation of AUDR won't be available immediately. There may be circumstances requiring some amount of reingestion we are unable to predict at this time.

From this point forward, if a situation occurs requiring reingestion, it is imperative that anything released 'to the wild' never disappears. It must be citable and accessible at a URL - ideally in perpetuity.

Mike Cherry (@jmcherry) has also emphasized this.

@diekhans
Copy link

@claymfischer what are the circumstances that would require reingestion?

@sampierson
Copy link
Member

I think the big 3 are:

  • Metadata correction
  • Data correction
  • Pipeline correction/changes
    Am I close?

@claymfischer
Copy link

"There may be circumstances requiring some amount of reingestion we are unable to predict at this time." - maybe something an early AUDR implementation is unable to resolve.

@diekhans
Copy link

Am I close?

Very. although I would argue that even in these cases unless it is scientifically really bad, to avoid reingestion. If desperate, it would be "additional ingestion". Create a newer project and leave the old one there. Maybe avoid displaying in the browser.

Redaction of data, well, you do purge the data and leave the metadata. Do manually.

Redaction of metadata (yes this can happen); Manually edit the DSS .

@xbrianh
Copy link
Member

xbrianh commented Jun 26, 2019

Redaction of data, well, you do purge the data and leave the metadata. Do manually.

Redaction of metadata (yes this can happen); Manually edit the DSS .

Would an example of redaction be data where consent has been revoked? My understanding is that the DSS should support permanent deletion for these cases.

DSS objects may be updated with a new version, tombstoned, or deleted. Other than that, I'm not sure what "editing" would mean.

@jahilton
Copy link

I think the big 3 are:

  • Metadata correction
  • Data correction
  • Pipeline correction/changes
    Am I close?

Schema updates also require re-ingestion?

@diekhans
Copy link

Schema updates also require re-ingestion?

Currently, there is no fully working mechanism for update metadata or data in the system

To try to summarize my view since this is drifting.

  1. Re-ingesting was a short-term hack to move various development activities forward. It is neither part of a design or an SOP.
  2. Once the site goes officially live next month, we should not reingest unless it is an emergency. Even then, we need to keep the old data around (unless redacted), despite any pain and confusion.
  3. We need to all work towards getting a complete update mechanism working to support scientific requirements.

@diekhans
Copy link

Would an example of redaction be data where consent has been revoked? My understanding is that the DSS should support permanent deletion for these cases.

Yes, redaction is when we are not legally allowed to have the data. Data never being properly consented in the first place. Personally identifiable information accidentally being added to metadata sometimes happens.

In all cases, we need to remove the data completely, but still being able to inform users it was redacted, rather than just `not found'.

@theathorn
Copy link

@diekhans "reingest ... we need to keep the old data around". Not sure how that would work. We don't want 2 copies of a project (i.e. the same project with different project UUIDs) in the "live/latest" view as they will be counted twice in the Data Browser. Proper versioning would allow us to show the "latest" and "previous releases" views (Azul currently only indexes the latest version of files but we have started discussing the needed changes to support versioning / data releases).

@diekhans
Copy link

@theathorn I would work rather rather badly. The browser would have to put in some kind of ugly hack to not display the old project, like the uuid hard coded?

But this would be an emergency anyway and it would go away

@justincc
Copy link

@diekhans I accidentally thought this was about re-ingestion. It is indeed about re-analysis which would either be a new analysis where >=2 analysis bundles would have the same parent primary bundle or reanalysis (as is the case here), which will be an updated where an existing bundle in the datastore gets a new version. In both cases it is up to analysis to tell us what they want and there will be a mechanism for them to do so.

@justincc
Copy link

justincc commented Jun 27, 2019

If there is experiment structure change then the only current available action, as @lauraclarke has previously said is to re-ingest. We could probably make the new structure a new submission to the project and tombstone the previous submission, which would keep the project UUID. This will still break provenance unless the reason for change is kept in a file that remains in both submissions such as the file that stores the project metadata. If we want to explore this solution then it should be added to 2019Q3 AUDR asap unless it's acceptable to push off to Q4, since we will need to do some work on preserving project UUID.

@justincc
Copy link

@jahilton We are coming up to the point where simple metadata updates (those that are changing existing fields) will start becoming reality. In Q3 we are scheduled to also allow addition of data (and implicitly I guess data update - this is not something that was in Q2!). Add/remove of fields (schema update) is probably a simple addition from ingest's point of view, albeit sometimes one with a large delta. The big issue is structural change where it's extremely difficult/impossible to establish relationships between changing bundles.

This was explored in the Feb 2019 F2F

@justincc
Copy link

Included within #416

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Spike Agile Spike
Projects
None yet
Development

No branches or pull requests

8 participants