-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define reanalysis bundling strategy #236
Comments
As discussed (though not yet with you @morrisonnorman! Let's talk after the hols) I am bouncing this to @morrisonnorman to really nail down the requirements before taking up the technical aspects. |
Linked to #234 |
Linked to #224. |
The current strategy is that we simple re-ingest the whole project. This will lose any provenance. This may be tweakable by at least keeping the same project UUID even if all the bundles are completely different. This will require tombstoning the old bundles or updating them to show they are now obsolete (as per @diekhans). |
I am concerned that re-ingesting a whole project and tombstoning the old data would make data citations (see #226) not viable. i.e. Scientists who cite DCP data in publications expect the reference to that data and the cited data itself to remain unchanged. I would like to hear Tech Architecture's view on this problem. |
Our update process should probably not have the side effect of making data disappear. This is problematic for citations, as @theathorn noted, and I don't think it's technically necessary. |
I was under the impression that reingestion was a one-time thing. One we go public next month, it would be really bad to reingest. it is far better to have inconsistent metadata and project that do not update than to lose provenance. However, note that is is about reanalysis, which may or may not get a new UUID dependent on policy. However, the previous analysis must remain. |
In a perfect world this recent reingestion would be the last time, but realistically full implementation of AUDR won't be available immediately. There may be circumstances requiring some amount of reingestion we are unable to predict at this time. From this point forward, if a situation occurs requiring reingestion, it is imperative that anything released 'to the wild' never disappears. It must be citable and accessible at a URL - ideally in perpetuity. Mike Cherry (@jmcherry) has also emphasized this. |
@claymfischer what are the circumstances that would require reingestion? |
I think the big 3 are:
|
"There may be circumstances requiring some amount of reingestion we are unable to predict at this time." - maybe something an early AUDR implementation is unable to resolve. |
Very. although I would argue that even in these cases unless it is scientifically really bad, to avoid reingestion. If desperate, it would be "additional ingestion". Create a newer project and leave the old one there. Maybe avoid displaying in the browser. Redaction of data, well, you do purge the data and leave the metadata. Do manually. Redaction of metadata (yes this can happen); Manually edit the DSS . |
Would an example of redaction be data where consent has been revoked? My understanding is that the DSS should support permanent deletion for these cases. DSS objects may be updated with a new version, tombstoned, or deleted. Other than that, I'm not sure what "editing" would mean. |
Schema updates also require re-ingestion? |
Currently, there is no fully working mechanism for update metadata or data in the system To try to summarize my view since this is drifting.
|
Yes, redaction is when we are not legally allowed to have the data. Data never being properly consented in the first place. Personally identifiable information accidentally being added to metadata sometimes happens. In all cases, we need to remove the data completely, but still being able to inform users it was redacted, rather than just `not found'. |
@diekhans "reingest ... we need to keep the old data around". Not sure how that would work. We don't want 2 copies of a project (i.e. the same project with different project UUIDs) in the "live/latest" view as they will be counted twice in the Data Browser. Proper versioning would allow us to show the "latest" and "previous releases" views (Azul currently only indexes the latest version of files but we have started discussing the needed changes to support versioning / data releases). |
@theathorn I would work rather rather badly. The browser would have to put in some kind of ugly hack to not display the old project, like the uuid hard coded? But this would be an emergency anyway and it would go away |
@diekhans I accidentally thought this was about re-ingestion. It is indeed about re-analysis which would either be a new analysis where >=2 analysis bundles would have the same parent primary bundle or reanalysis (as is the case here), which will be an updated where an existing bundle in the datastore gets a new version. In both cases it is up to analysis to tell us what they want and there will be a mechanism for them to do so. |
If there is experiment structure change then the only current available action, as @lauraclarke has previously said is to re-ingest. We could probably make the new structure a new submission to the project and tombstone the previous submission, which would keep the project UUID. This will still break provenance unless the reason for change is kept in a file that remains in both submissions such as the file that stores the project metadata. If we want to explore this solution then it should be added to 2019Q3 AUDR asap unless it's acceptable to push off to Q4, since we will need to do some work on preserving project UUID. |
@jahilton We are coming up to the point where simple metadata updates (those that are changing existing fields) will start becoming reality. In Q3 we are scheduled to also allow addition of data (and implicitly I guess data update - this is not something that was in Q2!). Add/remove of fields (schema update) is probably a simple addition from ingest's point of view, albeit sometimes one with a large delta. The big issue is structural change where it's extremely difficult/impossible to establish relationships between changing bundles. This was explored in the Feb 2019 F2F |
Included within #416 |
Need
Ensure that new analysis results resulting from reanalysis of existing data are propagated to the datastore and take precedence over the previous analysis results.
Scoping Task
dcp-community
repo.The text was updated successfully, but these errors were encountered: