Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: transfer names are not checked for duplicates before ingest #851

Open
sallain opened this issue Feb 1, 2024 · 5 comments
Open

Comments

@sallain
Copy link
Collaborator

sallain commented Feb 1, 2024

Is your feature request related to a problem? Please describe.

One of Artefactual's clients creates transfer packages with a consistent naming structure. The client creates many packages at a time and uploads them to Enduro en masse. Because their package creation and management method is relatively manual, there is a chance that human error will result in the same package being uploaded more than once, either within the same upload or over the course of any number of uploads, resulting in two AIPs with the same name. There are two possible reasons for a package to have the same name as a previous package: either it contains the same material and therefore has the same identifiers (which are used for the package name), or it is an error by the person creating the package. In either case, the user should be notified that an AIP with that name already exists in storage.

The need to have unique package names can also be extrapolated out to best practices - having two packages with the same name hinders searchability and could be considered a preservation risk, regardless of whether or not the contents are identical. Even though Archivematica/a3m's use of UUIDs prevents file naming collisions, users should be able to ensure that the human-readable or contextually significant part of the package name is also unique.

Describe the solution you'd like

Implement a check the compares the name of the new package to all packages that have been previously processed by the Enduro instance. If a duplicate name is detected, the user should be notified and the package should not be sent for ingest.

The check should be able to work across multiple transfer source locations.

Describe alternatives you've considered

None

Additional context

Legacy Enduro has implemented such a check, but I think there's a chance that it ONLY looks at the contents of a given batch, rather than across the full history of transfers. This implementation, I believe, doesn't look at the AIP store for transfer names. How this feature will work for an Enduro instance that is already in use is an open question - would love to hear opinions about how far back, if at all, the check should be looking.

Note that the desired solution doesn't suggest that we look for duplicate materials; that is, it doesn't need to see if the same image or video has already been preserved. In my opinion, that's a separate (and potentially more complicated) feature. This feature is just for transfer names.

@sallain sallain changed the title Feature: check for duplicates based on transfer name Problem: transfer names are not checked for duplicates before ingest Feb 2, 2024
@djjuhasz
Copy link
Collaborator

djjuhasz commented Mar 7, 2024

Here's where duplicate transfer name check functionality was added to artefactual-labs/enduro:
https://github.com/artefactual-labs/enduro/pull/548/files

The user manual description of the rejectDuplicates option provides a good summary of how the check works:

rejectDuplicates (Boolean)

When enabled, the workflow will execute a check on the internal database for
successfully completed transfers with the same transfer name as the currently
processing package. If it finds a duplicate the transfer will fail.

Note that the "internal database" is the Enduro database - so it's only checking the name against other transfers successfully processed by Enduro.

@aseles13
Copy link

aseles13 commented Apr 3, 2024

Is there anything else we need to do with this issue @djjuhasz and @sallain? Or does !548 address this?

@djjuhasz
Copy link
Collaborator

djjuhasz commented Apr 3, 2024

@aseles13 we still need to implement a solution for this issue in SDPS Enduro - artefactual-labs/enduro#548 only applies to "Legacy" Enduro.

@Diogenesoftoronto
Copy link
Contributor

I looked at this and it seems that to fully solve this issue it would have to work for arbitrary transfer source locations. That seems like that would mean it would have to look at external databases in other preservation systems, for example an Archivematica Storage Service instance that has transfers. I am curious if that is still the intended solution or if we have decided that would be scope creep for the Enduro project.

@sallain
Copy link
Collaborator Author

sallain commented Apr 3, 2024

Per offline discussion, we're going to keep the first iteration simple - an internal database will record new transfers that are completed and check against that. The feature will not try to look back in time at packages ingested before the feature was implemented.

In the future I could see someone perhaps wanting to connect another database source, or maybe wanting to populate the internal database with historical transfers, but we won't worry about that right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants