-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem: transfer names are not checked for duplicates before ingest #851
Comments
Here's where duplicate transfer name check functionality was added to artefactual-labs/enduro: The user manual description of the
Note that the "internal database" is the Enduro database - so it's only checking the name against other transfers successfully processed by Enduro. |
@aseles13 we still need to implement a solution for this issue in SDPS Enduro - artefactual-labs/enduro#548 only applies to "Legacy" Enduro. |
I looked at this and it seems that to fully solve this issue it would have to work for arbitrary transfer source locations. That seems like that would mean it would have to look at external databases in other preservation systems, for example an Archivematica Storage Service instance that has transfers. I am curious if that is still the intended solution or if we have decided that would be scope creep for the Enduro project. |
Per offline discussion, we're going to keep the first iteration simple - an internal database will record new transfers that are completed and check against that. The feature will not try to look back in time at packages ingested before the feature was implemented. In the future I could see someone perhaps wanting to connect another database source, or maybe wanting to populate the internal database with historical transfers, but we won't worry about that right now. |
Is your feature request related to a problem? Please describe.
One of Artefactual's clients creates transfer packages with a consistent naming structure. The client creates many packages at a time and uploads them to Enduro en masse. Because their package creation and management method is relatively manual, there is a chance that human error will result in the same package being uploaded more than once, either within the same upload or over the course of any number of uploads, resulting in two AIPs with the same name. There are two possible reasons for a package to have the same name as a previous package: either it contains the same material and therefore has the same identifiers (which are used for the package name), or it is an error by the person creating the package. In either case, the user should be notified that an AIP with that name already exists in storage.
The need to have unique package names can also be extrapolated out to best practices - having two packages with the same name hinders searchability and could be considered a preservation risk, regardless of whether or not the contents are identical. Even though Archivematica/a3m's use of UUIDs prevents file naming collisions, users should be able to ensure that the human-readable or contextually significant part of the package name is also unique.
Describe the solution you'd like
Implement a check the compares the name of the new package to all packages that have been previously processed by the Enduro instance. If a duplicate name is detected, the user should be notified and the package should not be sent for ingest.
The check should be able to work across multiple transfer source locations.
Describe alternatives you've considered
None
Additional context
Legacy Enduro has implemented such a check, but I think there's a chance that it ONLY looks at the contents of a given batch, rather than across the full history of transfers. This implementation, I believe, doesn't look at the AIP store for transfer names. How this feature will work for an Enduro instance that is already in use is an open question - would love to hear opinions about how far back, if at all, the check should be looking.
Note that the desired solution doesn't suggest that we look for duplicate materials; that is, it doesn't need to see if the same image or video has already been preserved. In my opinion, that's a separate (and potentially more complicated) feature. This feature is just for transfer names.
The text was updated successfully, but these errors were encountered: