Collection Criteria

The following criteria were used to appraise the research dataset files that were included in the RDSS-Archivematica Test Data Corpus collection:

License and rights: The corpus files must be in the public domain or be published under a valid re-use license so that they can be freely shared amongst the many collaborators within the RDSS project as well as for the benefit of other interested parties.
Project relevance:

Preference for datasets with a connection to a HEI that is piloting RDSS-Archivematica.
Preference for datasets that come from UK HEI research institutions and/or UK funded research.
Preference for datasets that have been exported from RDSS-HEI repository applications and/or with metadata from other integrated systems. NOTE: As of 18-06-2017 these are not yet known to the rdss-archivematica team but likely: Figshare, Pure, Eprints, Dspace (or localized forks/branches, eg Vivo), Fedora/Hydra (or localized forks/branches, eg Willow), as well as RDM services such as DataCite and ORCID.

RDSS MVP/Alpha performance tests: bitstreams =< 5 GB, files < 1000, file =< 1MB // NOTE: Github 1GB limit for free public repos. Will need to resolve out links for 20MB?+ files to other storage domain. See https://git-lfs.github.com/
RDSS Beta performance tests: bitstreams =< 5 TB, files < 1000000, file =< 1TB
Dataset complexity: Preference is for dataset collections with moderately complex content, context, resource types, media types and file formats. Ideally at least three of the following are present:

Dataset Packaging: At a minimum, one data set that contains a 'simple' package (e.g. a zip or tar file). The test data should also includes a package with some standard semantics (e.g. a Bag with a manifest file)
Metadata Quality:

Sample datasets and related articles must have a DOI.
Datacite has a mandatory core set of properties that must be provided in order for a dataset to receive a DOI. This is used as the minimum metadata requirement for the RDSS-Archivematica MVP release.
Preference is for higher quality metadata that includes more detailed technical, administrative, and descriptive information about the dataset creators and its context of creation and use.
Preference is for metadata that is serialized (eg. XML, JSON, CSV) and standardized (e.g. Dublin Core, DATS, DCAT, PROV-O) in formats that are equivalent to those used by RDSS HEI pilot institutions in their RDM repositories.
Preference is for files that include published checksums.

Provide feedback