Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query Dataverse for mandatory metadata fields via API #6978

Closed
richarda23 opened this issue Jun 11, 2020 · 19 comments · Fixed by #10109
Closed

Query Dataverse for mandatory metadata fields via API #6978

richarda23 opened this issue Jun 11, 2020 · 19 comments · Fixed by #10109
Labels
Feature: API Hackathon: More APIs Add new or missing API endpoints HERMES related to @hermes-hmc work on Dataverse code Type: Feature a feature request User Role: API User Makes use of APIs

Comments

@richarda23
Copy link

RSpace ELN uses the Dataverse API to submit research data to Dataverse. It has a minimal UI for metadata fields such as title, subject, description, authors, contacts and this works on various Dataverses till now.

One of our RSpace customers has their own Dataverse as well - 4.19. They have configured Dataverse to require additional metadata when submitting a Dataset. RSpace doesn't know these fields are mandatory and submission fails:

Deposit failed: ERROR2020-06-09T10:58:26ZProcessing failedCouldn't update dataset edu.harvard.iq.dataverse.engine.command.exception.IllegalCommandException: Validation Failed: Producer Name is required. (Invalid value:edu.harvard.iq.dataverse.DatasetField[ id=null ]), Distributor Name is required. (Invalid value:edu.harvard.iq.dataverse.DatasetField[ id=null ]), Description Date is required. (Invalid value:edu.harvard.iq.dataverse.DatasetField[ id=null ]), Keyword Term is required. (Invalid value:edu.harvard.iq.dataverse.DatasetField[ id=null ]), Deposit Date is required. (Invalid value:edu.harvard.iq.dataverse.DatasetField[ id=null ]).

This corresponds exactly to what are the required properties as sent by the Dataverse admin, that are not set by RSpace

  • Author - Name
  • Contact - Name
  • Contact - Email
  • Description - Text
  • Description - Date
  • Keyword - Term
  • Producer - Name
  • Distributor - Name (In our default templates, this is always the name of the (sub-)dataverse. I'm not sure how this should be handled when a dataset is created from RSpace.)
  • Deposit Date (In Dataverse, this is generated by the system.)

If this list never changes, then RSpace could develop a solution where it reads a list of mandatory fields from a configuration file. But if it does change from time to time, it would be great if there was an API method in Dataverse to get a list of mandatory metadata fields .Then, a client could programmatically generate input fields for these properties so that the end-user could make a valid submission.

@richarda23
Copy link
Author

Just to add, the customer has said that once they have defined their mandatory metadata for a dataverse or subdataverse, it seldom or never changes. So we (RSpace) could just use some static-list-lookup mechanism to handle this particular use case . But being able to retrieve the properties from an API call would be superior as it would always be up-to-date.

@djbrooke
Copy link
Contributor

Thanks @richarda23, this is a good idea and makes sense.

@djbrooke
Copy link
Contributor

djbrooke commented Nov 9, 2021

A question mostly for @pdurbin - would the solution implemented in #7942 allow for this? Or no?

@pdurbin
Copy link
Member

pdurbin commented Nov 9, 2021

Hmm, from pull request #7942 I believe metadata_fields=citation:* will only show you the citation fields that have been filled in. There are many more citation fields that are not shown. Also, it doesn't indicate if fields are required are not. Good thought though. People might come up with creative uses for that new functionality. 😄

Something else to consider is that templates can require additional fields but I don't know if the API respects this or not. It should if it doesn't. From the original report above it sounds like templates may be in use and enforced via API because Producer Name is not one of the five fields that is usually required.

This issue is related in the sense that it would be nice if the API could return more information about what it needs or allows:

There is an "admin" API that can give some detail on metadata fields but as noted at https://guides.dataverse.org/en/5.8/admin/metadatacustomization.html#exploring-metadata-blocks the output is ugly and it could stand to be cleaned up before it's ready for public consumption:

Screen Shot 2021-11-09 at 11 49 38 AM

Here you can see that "title" is required:

$ curl -s http://localhost:8080/api/admin/datasetfield/title | jq .
{
  "status": "OK",
  "data": {
    "name": "title",
    "id": 1,
    "title": "Title",
    "metadataBlock": "citation",
    "fieldType": "TEXT",
    "allowsMultiples": false,
    "hasParent": false,
    "controlledVocabularyValues": [],
    "parentAllowsMultiples": "N/A (no parent)",
    "solrFieldSearchable": "title",
    "solrFieldFacetable": "title_s",
    "isRequired": true,
    "uri": "http://purl.org/dc/terms/title"
  }
}

@djbrooke
Copy link
Contributor

djbrooke commented Nov 9, 2021

Got it - thanks @pdurbin. I was hopeful :)

@poikilotherm
Copy link
Contributor

poikilotherm commented Nov 9, 2021

This issue is also relevant to @hermes-hmc, as we might want to validate metadata before depositing instead of try-and-error.

I'm going to add the Hermes label for easier tracking what might be in scope for our project.

@poikilotherm poikilotherm added the HERMES related to @hermes-hmc work on Dataverse code label Nov 9, 2021
@Kris-LIBIS
Copy link
Contributor

At KU Leuven we are interested in this as well for future integrations with our other systems. One additional piece of information that would be required to generate valid submissions, is the allowed values for fields with controlled vocabulary. The external vocabularies may make that way more complex, but I hope that would be supported too.

@philippconzett
Copy link
Contributor

In a future version of Dataverse where issue 6885 hopefully has been solved, also recommended metadata fields should pop up in the integrated system. I guess in some/many cases, the list of metadata fields can become quite long, as we want our depositors to provide as much metadata as possible. I'm wondering whether users in most (all?) cases anyway would have to navigate to the actual dataset draft in Dataverse and add additional metadata. So, the question is how integration involving metadata registration can be designed in a way that makes the researcher's work as easy as possible.

@philippconzett
Copy link
Contributor

I have just had a discussion with @shlake about Dataverse integrations with tools like OSF and RSpace.
The more I think of these kinds of integration, I think the integration needs to go the other way round, that is from Dataverse to OSF, RSpace, etc. This would mean that a user creates a dataset in Dataverse where they can use a dataset/metadata template. When uploading files, they would be able to access tools like OSF and RSpace to select the files they want to upload to the dataset. (In the same way the can – if the integration is activated – select and upload files from Dropbox etc. I know that Harvard/Dataverse want integrations to work the other way round (= from OSF/RSpace to Dataverse), but I think for file upload integrations, it's most convenient for the user to start by creating a dataset WITHIN Dataverse.

@Kris-LIBIS
Copy link
Contributor

The university and their researchers have made it very clear that they expect to be able to create the datasets from the institutional repository (Symplectic Elements). We hope to go live in January next year and that will be without that feature. But we will have to implement that somehow in the next year or so. Then there is also a request to migrate datasets from iRODS to Dataverse.

Agreed, we may be able to work around it to some extend, but solving this GitHub issue would surely make it easier to get our integration scenarios working.

@richarda23
Copy link
Author

richarda23 commented Nov 22, 2021

From RSpace perspective, the idea of researchers being able to make deposits from an ELN is solely to lower the barrier to entry to getting data and files and associated metadata (like Orcid IDs, author information) into a repository, and to be able to do so from a familiar software.
We certainly don't intend to replicate the full rich editing experience of every repository we integrate with. If an institution requires a large number of compulsory data fields, could that be implemented as a requirement for publishing the dataset, rather than merely adding content to it?
E.g.

  • initial submission /creation from RSpace requires minimal data ( enough to satisfy Dataverse's database schema or API validation )
  • Further metadata / data added in Dataverse UI
  • Publishing requires compulsory fields to have valid values.

I don't think the counter-proposal of pulling from an ELN into Dataverse is an either / or scenario; both would work depending on what user prefers. But the 'pull from ELN' requires Dataverse to develop UI to browse and configure exports for each and every for ELN or datasource it wants to support.

A 2-step procedure would make it easy for researchers to get started making a deposit in draft form, yet still require full verification in order to publish.

@pdurbin
Copy link
Member

pdurbin commented Nov 22, 2021

@richarda23 interesting thought. Please see also this issue:

We already have a concept of a N/A value that has to be replace with a real value before the dataset is published via the GUI. To see this in action:

  • Create a dataset via SWORD without a subject (the subject will be "N/A" in the database, see "N/A" at https://guides.dataverse.org/en/5.8/api/sword.html )
  • Edit the dataset metadata in the GUI and try to save. You will be prompted to pick a subject from the controlled vocabulary before you can save.

@TaniaSchlatter
Copy link
Member

TaniaSchlatter commented Nov 23, 2021

This is a great conversation, and provides specific examples that relate to several issues. Discovery work on #7376 points to possible benefits of a two-step process like what @richarda23 outlines.

#7376 originated from a few questions: "how might we help users add metadata without making it too laborious to publish a dataset?," "how might we make adding and editing metadata more clear?" and "how might we more clearly define what is considered metadata?" We reviewed features and considered deposit/edit workflows. The next step is to mock up UI changes that present metadata required to create a draft more clearly as step 1, and additional, configurable metadata (could be required, recommended, optional as suggested in #6885) as step 2, prior to publishing.

While the need to query for mandatory fields may still be necessary, I wonder if it is possible to instead agree on metadata for creating a draft, and build out/improve how datasets are "enriched".

@Kris-LIBIS
Copy link
Contributor

I'm in favour of the 2-step approach and the idea of being able to save a draft dataset with incomplete metadata. Like @richarda23, our aim is to be able to create a dataset from another application and to transfer as much as possible the data and metadata that is already known in the external application. If that metadata does not have to be complete, that would make the integration process much easier. We agree that finalizing the dataset and publishing it should be done within Dataverse.

Still, being able to query Dataverse for the details of the metadata is a plus. It would be helpful in mapping metadata between applications. I assume that the API call should operate on a given Dataverse collection.

@richarda23
Copy link
Author

Yes, agree totally. Knowing what is required metadata would also help external app know what metadata to send; there might be some metadata fields that require some computation or straightforward input from user that it would be good to know about at the time of deposition. It would give the external app the best chance of doing a valid submission, which would be good for user. After submission, Dataverse could respond with a boolean indicator of success or a list of required fields that are missing; this could be indicated to the user. It would be nice for user to have immediate feedback that their submission is accepted, valid and ready to be published.

@philippconzett
Copy link
Contributor

I guess for "standard" dataverses/collections, the current approach may work fine. For more customized dataverses where e.g. metadata templates with pre-filled fields are used to make deposit as easy as possible, an external tool >> Dataverse integration might be more cumbersome for the depositor. In these cases, I'd prefer to create the dataset within Dataverse, and then - if there is no Dataverse >> external tool integration that allows you to upload the data from the external tool - I'd go back to the external tool and push the data into the created dataset. If I remember correctly, this is how the OSF >> Dataverse integration works, thus you have to choose a specific dataset when you want to push your files into Dataverse.

@pdurbin
Copy link
Member

pdurbin commented Oct 9, 2022

for "standard" dataverses/collections, the current approach may work fine... this is how the OSF >> Dataverse integration works, thus you have to choose a specific dataset when you want to push your files into Dataverse.

Right, integrations like OSF, RSpace, OJS, and Renku all assume "standard" collections and only send the five required fields (title, author, subject, description, contact). They don't have any way to query the Dataverse installation to ask if any other fields are required for this or that collection.

Subject is a fixed controlled vocabulary and this old issue is about how you can't query that either. Even though the list is fixed, it should also be queryable so apps like RSpace, etc. don't have to hard code the list:

Finally, this older is very similar:

2023-01-20 update... related:

@mreekie mreekie moved this to Community Dev in IQSS Dataverse Project Nov 1, 2022
@mreekie mreekie moved this from Community Dev to NeedsDiscussion in IQSS Dataverse Project Nov 1, 2022
@mreekie mreekie moved this from NeedsDiscussion to Dataverse Team (Gustavo) in IQSS Dataverse Project Nov 2, 2022
@mreekie mreekie moved this to Community Backlog (Phil) in IQSS Dataverse Project Nov 7, 2022
@mreekie mreekie removed the bk2211 label Jan 11, 2023
@pdurbin
Copy link
Member

pdurbin commented Jul 13, 2023

From @Kris-LIBIS:

"we have a dependency on issue #6978 to know which metadata fields are available in dataverse, which are mandatory and what controlled vocabulary valid field values are.

in absence of a solution for the issue above, we submitted PR #8940 and that is merged now and ready for 5.14. The PR will allow the RDM integration tool to create datasets with no metadata at all."

-- https://groups.google.com/g/dataverse-community/c/aGt1ILi1Hf4/m/fnGO-Io_AQAJ

@pdurbin
Copy link
Member

pdurbin commented Nov 30, 2023

Hello, all!

@richarda23 and everyone, does the following PR resolve this issue? Should mark it as closing it (on merge)?

Update: I went ahead and marked the PR to close this issue on merge.

@github-project-automation github-project-automation bot moved this from Community Backlog (Phil) to Clear of the Backlog in IQSS Dataverse Project Dec 5, 2023
@DieuwertjeBloemen DieuwertjeBloemen moved this to Interesting/To keep an eye on in KU Leuven RDR Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: API Hackathon: More APIs Add new or missing API endpoints HERMES related to @hermes-hmc work on Dataverse code Type: Feature a feature request User Role: API User Makes use of APIs
Projects
Status: No status
Status: Interesting/To keep an eye on
Status: Done
Development

Successfully merging a pull request may close this issue.

8 participants