Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support flexible DataCite resourceType metadata in a Dataset #7077

Open
poikilotherm opened this issue Jul 14, 2020 · 13 comments
Open

Support flexible DataCite resourceType metadata in a Dataset #7077

poikilotherm opened this issue Jul 14, 2020 · 13 comments
Labels
Feature: DOI & Handle Feature: Metadata HERMES related to @hermes-hmc work on Dataverse code Size: Queued PM has called this issue out specifically for sizing Type: Feature a feature request UX & UI: Design This issue needs input on the design of the UI and from the product owner

Comments

@poikilotherm
Copy link
Contributor

poikilotherm commented Jul 14, 2020

tl;dr: Dataverse should offer a UI component to select the general dataset type with a CV based on DataCite/Dublin Core. The selected type should be used for metadata registration at DataCite.

Related issues:

Jülich DATA has an open request by @sciapp to publish software in our repo and get a DOI for releases (so inline with FORCE11 recommendations and DataCite recommendations).

Recently, the RDA WG has published a paper open for community comments about the different PID options for software publications. The interesting part for Dataverse is the diagram for registered software datasets at Datacite.

Currently, software that gets published via Dataverse, will have a resourceTypeGeneral="DataSet" attached to it, as the metadata template does not allow for customization. (This is also true for #5086, where we might think about using type "Collection" for the dataset automatically and specific types for the files.)

Having software counted as "DataSet" makes things less discoverable and does not push research software engineering forward. See DataCite Schema Docs for a complete list of types.

A full example of software metadata can be found at DataCite.

Things to do for an implementation:

  1. Obviously, this will need a new metadata field in citation.tsv, using a controlled vocabulary based on resourceTypeGeneral. Should be mandatory, but happy to discuss. This will definitely have to be present onCreate, but could use "DataSet" as a default!
  2. Extend the metadata template and code to fill in the value selected in UI.
  3. Extend the metadata exporters where feasible.

As this is a request to our ZB services and will become more important for Software Citations, we offer implementing it at the dataset level. Maybe @philippconzett can collaborate to provide it for file level in the same go?
Comments please! 🚀

Pinging @IngoHeimbach @doigl @bronger @mfenner @TaniaSchlatter

@poikilotherm poikilotherm added Type: Feature a feature request UX & UI: Design This issue needs input on the design of the UI and from the product owner Feature: Metadata Feature: DOI & Handle Medium labels Jul 14, 2020
@philippconzett
Copy link
Contributor

Are you suggesting to introduce a general resource type "Software" (or similar)? In the SCID WG report you refer to, they are also talking about a more granular classification of "software artifacts", e.g. code fragment, file, directory, commit, release, ... And at the end of the report, the authors conclude that "[t]he next step would be to produce a set of recommendations based on these findings". Maybe the implementation in Dataverse should be based on the forthcoming recommendations?

As for the resource type of files within a dataset, you say:

This is also true for #5086, where we might think about using type "Collection" for the dataset automatically and specific types for the files.

I'd rather not replace "Dataset" with "Collection" as resource type to classify datasets. Although datasets may be seen as collections of files, I think we should reserve the term "Collection" for collections of datasets and other research outputs (e.g. software). That's at least how I interpreted "Collection" when I applied it to a sub-dataverse within DataverseNO upon request from a research group who wanted to have a DOI for their whole sub-dataverse / collection; cf. https://doi.org/10.18710/AJ4S-X394.

@TaniaSchlatter
Copy link
Member

@poikilotherm this is something we have discussed and there are many related topics/issues. I am pinging @jggautier and @djbrooke, as they are involved in considering this and related questions.

@poikilotherm
Copy link
Contributor Author

@TaniaSchlatter thank you!

@philippconzett
I am with you about reserving "Collection" for future use of providing DOI for Dataverses. Great idea. ❤️ 👍

About my suggestions: I would like to see a new metadata field in citation.tsv for that XML of <resourceType resourceTypeGeneral="$insert">$insert</resourceType>.

$insert would be a value from a controlled vocabulary based on DataCite/Dublin Core vocabs (which are the very same vocabulary).

As far as I understood the DataCite schema, they are open for more detailed/descriptive types in the XML value of <resourceType>, defaulting to the same value of resourceTypeGeneral.

@jggautier
Copy link
Contributor

jggautier commented Jul 27, 2020

While working on a GitHub integration, it was inherent that we wanted Dataverse to support software publication, so we spoke about how the current metadata exports label what's being published as "datasets" and how software would need to be represented in the metadata. But the DataCite and Dublin Core vocabs for resource types mention more than software.

@poikilotherm, is this GitHub issue, and in particular your comment above this one, recommending that Dataverse support the publication of all of those types of objects or just software/software artifacts?

@poikilotherm
Copy link
Contributor Author

poikilotherm commented Jul 27, 2020

The software publication is just my particular use case. The issue and implementation would, if you think that scope is fine, be about complete flexibility, as IMHO it doesn't make much sense to limit this to "DataSet" and "Software" artifically.

Instead, we should go for allowing the complete controlled vocabulary of terms as mentioned above. In terms of metadata blocks it's easy to do because it is a CV, can have a sensible default ("DataSet") and can stay hidden from the user if not supposed to be important for a Dataverse. In terms of creating the XML for DataCite it's easy to do, because it's about inserting two values in a simple String. Haven't looked into exports yet, but shouldn't be rocket science.

The metadata field in citation.tsv would be like:

name: resourceTypeGeneral
title: General Resource Type
description: What general type of resource fits best for this dataset?
watermark:
fieldType: text
displayOrder: 0
displayFormat: 
advancedSearchField: TRUE
allowControlledVocabulary: TRUE
allowmultiples: FALSE
facetable: TRUE
displayoncreate: FALSE
required: FALSE
parent: 
metadatablock_id: citation
termURI: http://purl.org/dc/terms/DCMIType

The controlled vocabulary would be as follows:

resourceTypeGeneral DataSet dataset 0
resourceTypeGeneral Event event 1
resourceTypeGeneral Image image 2
resourceTypeGeneral InteractiveResource interactiveresource 3
resourceTypeGeneral MovingImage movingimage 4
resourceTypeGeneral Software software 5
resourceTypeGeneral Sound sound 6
resourceTypeGeneral StillImage stillimage 7
resourceTypeGeneral Text text 8

Note: I left out Collection, PhysicalObject and Service on purpose.

Again:
This issue is not about doing it for software only.
This issue is not about going for 100% flexibility of the DataCite schema, which allows free text as value of <resourceType> (while resourceTypeGeneral attribute has a controlled vocabulary).

@poikilotherm
Copy link
Contributor Author

Thx @pdurbin initiating a call appointment for this.
People currently on the list to join: @jggautier and @djbrooke

@qqmyers as this is metadata related, would you like to be included in the poll for date and time?

@qqmyers
Copy link
Member

qqmyers commented Jul 28, 2020

Thanks. I'd suggest @adam3smith - he's very interested in/knowledgeable about best practices in reporting to DataCite. I'd be happy to join in but it looks like this would mostly involving using metadata blocks as designed versus requiring design changes (where I think my metadata focus is).

@poikilotherm
Copy link
Contributor Author

poikilotherm commented Aug 10, 2020

Thank you @jggautier and @djbrooke for our video call earlier today. Let me summarize our findings for future reference and being SLOPI.

We discussed the topic and came to the conclusion, that you would rather see this as part of a bigger solution towards finally solving #2739. @jggautier kindly provided a list of work items to properly support software deposition in Dataverse at https://docs.google.com/document/d/1cDzVyc70SXYnbdRolYfY9tSwu9NMzHaD-FcNGW_sNyU.

A short summary:

  1. DataCite Metadata enhancing (this issue)
  2. Make the "version" attribute capable for software versions instead of fixed to Dataverse internal versioning
  3. Enhance UI/UX textual elements not only to be about data, but at least software, too.
  4. Enhance software license representation (long standing issue License: Multiple Options for Licensing  #1753) and support it in metadata sent to DataCite et al (SPDX, ...).

For now, other key features are in focus, so we agreed this will happen in our fork for Jülich DATA right now. If we are to work on more items from that list, we keep each other posted via issues, screenshots etc. Our work would serve as a starting point for upstream support of this feature, once this gets more traction on your side again.

As this still of interest for you, I will leave this issue open. @djbrooke if you feel we should shorten the list of open issues, feel free to close.

@poikilotherm
Copy link
Contributor Author

poikilotherm commented Mar 29, 2022

For a recent discussion happening in #8536 I looked at the DataCite Schema 4.4 for resourceType again and now would consider these types:

  • Dataset
  • Software
  • Workflow

@pdurbin
Copy link
Member

pdurbin commented Oct 1, 2022

Recent discussion here:

@mreekie mreekie added the bk2211 label Nov 1, 2022
@mreekie mreekie moved this to Community Dev in IQSS Dataverse Project Nov 1, 2022
@poikilotherm poikilotherm moved this from Community Backlog (Phil) to HERMES (Oliver) in IQSS Dataverse Project Dec 7, 2022
@mreekie mreekie removed the bk2211 label Jan 11, 2023
@mreekie mreekie removed the sz.Medium label Jan 11, 2023
@mreekie mreekie added the Size: Queued PM has called this issue out specifically for sizing label Jan 23, 2023
@mreekie
Copy link

mreekie commented Jan 23, 2023

sizing:

  • PM added to ordered sizing queue

@poikilotherm
Copy link
Contributor Author

Leaving a note here that we might want to use https://vocabularies.coar-repositories.org/resource_types/

@pdurbin
Copy link
Member

pdurbin commented Jul 30, 2024

In this pull request...

... at 8593d32 I'm sending "Dataset", "Software" or "Workflow" for resourceTypeGeneral to DataCite. (Previously this was hard-coded to "Dataset".)

Here's an example of "Software" (next to the name, pyDataverse) in the DataCite test environment:

Screenshot 2024-07-29 at 5 16 31 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: DOI & Handle Feature: Metadata HERMES related to @hermes-hmc work on Dataverse code Size: Queued PM has called this issue out specifically for sizing Type: Feature a feature request UX & UI: Design This issue needs input on the design of the UI and from the product owner
Projects
Status: Important
Development

No branches or pull requests

7 participants