Skip to content

Commit

Permalink
Merge pull request #9089 from QualitativeDataRepository/IQSS/7349-4_c…
Browse files Browse the repository at this point in the history
…reator_updates

IQSS/7349-4 creator updates in schema.org
  • Loading branch information
kcondon authored Feb 1, 2023
2 parents 839065a + 863be65 commit 8ec1572
Show file tree
Hide file tree
Showing 8 changed files with 362 additions and 22 deletions.
5 changes: 5 additions & 0 deletions doc/release-notes/7349-4-schema.org-updates.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
The Schema.org metadata export and the schema.org metadata embedded in dataset pages has been updated to improve compliance with Schema.org's schema and Google's recommendations.

New jvm-option: dataverse.personOrOrg.assumeCommaInPersonName, default is false

Backward compatibility - author/creators now have an @type of Person or Organization and any affiliation (affiliation for Person, parentOrganization for Organization) is now an object of @type Organization
10 changes: 10 additions & 0 deletions doc/sphinx-guides/source/admin/metadataexport.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,3 +57,13 @@ Downloading Metadata via API
----------------------------

The :doc:`/api/native-api` section of the API Guide explains how end users can download the metadata formats above via API.

Exporter Configuration
----------------------

Two exporters - Schema.org JSONLD and OpenAire - use an algorithm to determine whether an author, or contact, name belongs to a person or organization. While the algorithm works well, there are cases in which it makes mistakes, usually inferring that an organization is a person.

The Dataverse software implements two jvm-options that can be used to tune the algorithm:

- :ref:`dataverse.personOrOrg.assumeCommaInPersonName` - boolean, default false. If true, Dataverse will assume any name without a comma must be an organization. This may be most useful for curated Dataverse instances that enforce the "family name, given name" convention.
- :ref:`dataverse.personOrOrg.orgPhraseArray` - a JsonArray of strings. Any name that contains one of the strings is assumed to be an organization. For example, "Project" is a word that is not otherwise associated with being an organization.
23 changes: 21 additions & 2 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2016,8 +2016,6 @@ By default, download URLs to files will be included in Schema.org JSON-LD output

``./asadmin create-jvm-options '-Ddataverse.files.hide-schema-dot-org-download-urls=true'``

Please note that there are other reasons why download URLs may not be included for certain files such as if a guestbook entry is required or if the file is restricted.

For more on Schema.org JSON-LD, see the :doc:`/admin/metadataexport` section of the Admin Guide.

.. _useripaddresssourceheader:
Expand Down Expand Up @@ -2047,6 +2045,27 @@ This setting is useful in cases such as running your Dataverse installation behi
"HTTP_FORWARDED",
"HTTP_VIA",
"REMOTE_ADDR"
.. _dataverse.personOrOrg.assumeCommaInPersonName:

dataverse.personOrOrg.assumeCommaInPersonName
+++++++++++++++++++++++++++++++++++++++++++++

Please note that this setting is experimental.

The Schema.org metadata and OpenAIRE exports and the Schema.org metadata included in DatasetPages try to infer whether each entry in the various fields (e.g. Author, Contributor) is a Person or Organization. If you are sure that
users are following the guidance to add people in the recommended family name, given name order, with a comma, you can set this true to always assume entries without a comma are for Organizations. The default is false.

.. _dataverse.personOrOrg.orgPhraseArray:

dataverse.personOrOrg.orgPhraseArray
++++++++++++++++++++++++++++++++++++

Please note that this setting is experimental.

The Schema.org metadata and OpenAIRE exports and the Schema.org metadata included in DatasetPages try to infer whether each entry in the various fields (e.g. Author, Contributor) is a Person or Organization.
If you have examples where an orgization name is being inferred to belong to a person, you can use this setting to force it to be recognized as an organization.
The value is expected to be a JsonArray of strings. Any name that contains one of the strings is assumed to be an organization. For example, "Project" is a word that is not otherwise associated with being an organization.


.. _dataverse.api.signature-secret:
Expand Down
52 changes: 36 additions & 16 deletions src/main/java/edu/harvard/iq/dataverse/DatasetVersion.java
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
package edu.harvard.iq.dataverse;

import edu.harvard.iq.dataverse.util.MarkupChecker;
import edu.harvard.iq.dataverse.util.PersonOrOrgUtil;
import edu.harvard.iq.dataverse.util.BundleUtil;
import edu.harvard.iq.dataverse.DatasetFieldType.FieldType;
import edu.harvard.iq.dataverse.branding.BrandingUtil;
Expand Down Expand Up @@ -1816,27 +1817,46 @@ public String getJsonLd() {
for (DatasetAuthor datasetAuthor : this.getDatasetAuthors()) {
JsonObjectBuilder author = Json.createObjectBuilder();
String name = datasetAuthor.getName().getDisplayValue();
String identifierAsUrl = datasetAuthor.getIdentifierAsUrl();
DatasetField authorAffiliation = datasetAuthor.getAffiliation();
String affiliation = null;
if (authorAffiliation != null) {
affiliation = datasetAuthor.getAffiliation().getDisplayValue();
}
// We are aware of "givenName" and "familyName" but instead of a person it might be an organization such as "Gallup Organization".
//author.add("@type", "Person");
author.add("name", name);
// We are aware that the following error is thrown by https://search.google.com/structured-data/testing-tool
// "The property affiliation is not recognized by Google for an object of type Thing."
// Someone at Google has said this is ok.
// This logic could be moved into the `if (authorAffiliation != null)` block above.
if (!StringUtil.isEmpty(affiliation)) {
author.add("affiliation", affiliation);
affiliation = datasetAuthor.getAffiliation().getValue();
}
String identifierAsUrl = datasetAuthor.getIdentifierAsUrl();
if (identifierAsUrl != null) {
// It would be valid to provide an array of identifiers for authors but we have decided to only provide one.
author.add("@id", identifierAsUrl);
author.add("identifier", identifierAsUrl);
JsonObject entity = PersonOrOrgUtil.getPersonOrOrganization(name, false, (identifierAsUrl!=null));
String givenName= entity.containsKey("givenName") ? entity.getString("givenName"):null;
String familyName= entity.containsKey("familyName") ? entity.getString("familyName"):null;

if (entity.getBoolean("isPerson")) {
// Person
author.add("@type", "Person");
if (givenName != null) {
author.add("givenName", givenName);
}
if (familyName != null) {
author.add("familyName", familyName);
}
if (!StringUtil.isEmpty(affiliation)) {
author.add("affiliation", Json.createObjectBuilder().add("@type", "Organization").add("name", affiliation));
}
//Currently all possible identifier URLs are for people not Organizations
if(identifierAsUrl != null) {
author.add("sameAs", identifierAsUrl);
//Legacy - not sure if these are still useful
author.add("@id", identifierAsUrl);
author.add("identifier", identifierAsUrl);

}
} else {
// Organization
author.add("@type", "Organization");
if (!StringUtil.isEmpty(affiliation)) {
author.add("parentOrganization", Json.createObjectBuilder().add("@type", "Organization").add("name", affiliation));
}
}
// Both cases
author.add("name", entity.getString("fullName"));
//And add to the array
authors.add(author);
}
JsonArray authorsArray = authors.build();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,10 @@ public static void writeCreatorsElement(XMLStreamWriter xmlw, DatasetVersionDTO
creator_map.put("nameType", "Personal");
nameType_check = true;
}

// ToDo - the algorithm to determine if this is a Person or Organization here
// has been abstracted into a separate
// edu.harvard.iq.dataverse.util.PersonOrOrgUtil class that could be used here
// to avoid duplication/variants of the algorithm
creatorName = Cleanup.normalize(creatorName);
// Datacite algorithm, https://github.com/IQSS/dataverse/issues/2243#issuecomment-358615313
if (creatorName.contains(",")) {
Expand Down Expand Up @@ -706,6 +709,11 @@ public static void writeContributorElement(XMLStreamWriter xmlw, String contribu
boolean nameType_check = false;
Map<String, String> contributor_map = new HashMap<String, String>();

// ToDo - the algorithm to determine if this is a Person or Organization here
// has been abstracted into a separate
// edu.harvard.iq.dataverse.util.PersonOrOrgUtil class that could be used here
// to avoid duplication/variants of the algorithm

contributorName = Cleanup.normalize(contributorName);
// Datacite algorithm, https://github.com/IQSS/dataverse/issues/2243#issuecomment-358615313
if (contributorName.contains(",")) {
Expand All @@ -717,6 +725,9 @@ public static void writeContributorElement(XMLStreamWriter xmlw, String contribu
// givenName ok
contributor_map.put("nameType", "Personal");
nameType_check = true;
// re: the above toDo - the ("ContactPerson".equals(contributorType) &&
// !isValidEmailAddress(contributorName)) clause in the next line could/should
// be sent as the OrgIfTied boolean parameter
} else if (isOrganization || ("ContactPerson".equals(contributorType) && !isValidEmailAddress(contributorName))) {
contributor_map.put("nameType", "Organizational");
}
Expand Down
155 changes: 155 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/util/PersonOrOrgUtil.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
package edu.harvard.iq.dataverse.util;

import java.util.ArrayList;
import java.util.List;
import java.util.logging.Logger;

import javax.json.JsonArray;
import javax.json.JsonObject;
import javax.json.JsonObjectBuilder;
import javax.json.JsonString;

import edu.harvard.iq.dataverse.export.openaire.Cleanup;
import edu.harvard.iq.dataverse.export.openaire.FirstNames;
import edu.harvard.iq.dataverse.export.openaire.Organizations;
import edu.harvard.iq.dataverse.util.json.JsonUtil;
import edu.harvard.iq.dataverse.util.json.NullSafeJsonBuilder;

/**
*
* @author qqmyers
*
* Adapted from earlier code in OpenAireExportUtil
*
* Implements an algorithm derived from code at DataCite to determine
* whether a name is that of a Person or Organization and, if the
* former, to pull out the given and family names.
*
* Adds parameters that can improve accuracy:
*
* * e.g. for curated repositories, allowing the code to assume that all
* Person entries are in <family name>, <given name> order.
*
* * allow local configuration of specific words/phrases that will
* automatically categorize one-off cases that the algorithm would
* otherwise mis-categorize. For example, the code appears to not
* recognize names ending in "Project" as an Organization.
*
*/

public class PersonOrOrgUtil {

private static final Logger logger = Logger.getLogger(PersonOrOrgUtil.class.getCanonicalName());

static boolean assumeCommaInPersonName = false;
static List<String> orgPhrases;

static {
setAssumeCommaInPersonName(Boolean.parseBoolean(System.getProperty("dataverse.personOrOrg.assumeCommaInPersonName", "false")));
setOrgPhraseArray(System.getProperty("dataverse.personOrOrg.orgPhraseArray", null));
}

/**
* This method tries to determine if a name belongs to a person or an
* organization and, if it is a person, what the given and family names are. The
* core algorithm is adapted from a Datacite algorithm, see
* https://github.com/IQSS/dataverse/issues/2243#issuecomment-358615313
*
* @param name
* - the name to test
* @param organizationIfTied
* - if a given name isn't found, should the name be assumed to be
* from an organization. This could be a generic true/false or
* information from some non-name aspect of the entity, e.g. which
* field is in use, or whether a .edu email exists, etc.
* @param isPerson
* - if this is known to be a person due to other info (i.e. they
* have an ORCID). In this case the algorithm is just looking for
* given/family names.
* @return
*/
public static JsonObject getPersonOrOrganization(String name, boolean organizationIfTied, boolean isPerson) {
name = Cleanup.normalize(name);

String givenName = null;
String familyName = null;

boolean isOrganization = !isPerson && Organizations.getInstance().isOrganization(name);
if (!isOrganization) {
for (String phrase : orgPhrases) {
if (name.contains(phrase)) {
isOrganization = true;
break;
}
}
}
if (name.contains(",")) {
givenName = FirstNames.getInstance().getFirstName(name);
// contributorName=<FamilyName>, <FirstName>
if (givenName != null && !isOrganization) {
// givenName ok
isOrganization = false;
// contributor_map.put("nameType", "Personal");
if (!name.replaceFirst(",", "").contains(",")) {
// contributorName=<FamilyName>, <FirstName>
String[] fullName = name.split(", ");
givenName = fullName[1];
familyName = fullName[0];
}
} else if (isOrganization || organizationIfTied) {
isOrganization = true;
givenName = null;
}

} else {
if (assumeCommaInPersonName && !isPerson) {
isOrganization = true;
} else {
givenName = FirstNames.getInstance().getFirstName(name);

if (givenName != null && !isOrganization) {
isOrganization = false;
if (givenName.length() + 1 < name.length()) {
familyName = name.substring(givenName.length() + 1);
}
} else {
// default
if (isOrganization || organizationIfTied) {
isOrganization = true;
givenName=null;
}
}
}
}
JsonObjectBuilder job = new NullSafeJsonBuilder();
job.add("fullName", name);
job.add("givenName", givenName);
job.add("familyName", familyName);
job.add("isPerson", !isOrganization);
return job.build();

}

// Public for testing
public static void setOrgPhraseArray(String phraseArray) {
orgPhrases = new ArrayList<String>();
if (!StringUtil.isEmpty(phraseArray)) {
try {
JsonArray phrases = JsonUtil.getJsonArray(phraseArray);
phrases.forEach(val -> {
JsonString strVal = (JsonString) val;
orgPhrases.add(strVal.getString());
});
} catch (Exception e) {
logger.warning("Could not parse Org phrase list");
}
}

}

// Public for testing
public static void setAssumeCommaInPersonName(boolean assume) {
assumeCommaInPersonName = assume;
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -85,13 +85,15 @@ public void testExportDataset() throws JsonParseException, ParseException, IOExc
assertEquals("https://doi.org/10.5072/FK2/IMK5A4", json2.getString("identifier"));
assertEquals("Darwin's Finches", json2.getString("name"));
assertEquals("Finch, Fiona", json2.getJsonArray("creator").getJsonObject(0).getString("name"));
assertEquals("Birds Inc.", json2.getJsonArray("creator").getJsonObject(0).getString("affiliation"));
assertEquals("Birds Inc.", json2.getJsonArray("creator").getJsonObject(0).getJsonObject("affiliation").getString("name"));
assertEquals("https://orcid.org/0000-0002-1825-0097", json2.getJsonArray("creator").getJsonObject(0).getString("@id"));
assertEquals("https://orcid.org/0000-0002-1825-0097", json2.getJsonArray("creator").getJsonObject(0).getString("identifier"));
assertEquals("https://orcid.org/0000-0002-1825-0097", json2.getJsonArray("creator").getJsonObject(0).getString("sameAs"));
assertEquals("Finch, Fiona", json2.getJsonArray("author").getJsonObject(0).getString("name"));
assertEquals("Birds Inc.", json2.getJsonArray("author").getJsonObject(0).getString("affiliation"));
assertEquals("Birds Inc.", json2.getJsonArray("author").getJsonObject(0).getJsonObject("affiliation").getString("name"));
assertEquals("https://orcid.org/0000-0002-1825-0097", json2.getJsonArray("author").getJsonObject(0).getString("@id"));
assertEquals("https://orcid.org/0000-0002-1825-0097", json2.getJsonArray("author").getJsonObject(0).getString("identifier"));
assertEquals("https://orcid.org/0000-0002-1825-0097", json2.getJsonArray("author").getJsonObject(0).getString("sameAs"));
assertEquals("1955-11-05", json2.getString("datePublished"));
assertEquals("1955-11-05", json2.getString("dateModified"));
assertEquals("1", json2.getString("version"));
Expand All @@ -115,7 +117,7 @@ public void testExportDataset() throws JsonParseException, ParseException, IOExc
assertEquals("LibraScholar", json2.getJsonObject("includedInDataCatalog").getString("name"));
assertEquals("https://librascholar.org", json2.getJsonObject("includedInDataCatalog").getString("url"));
assertEquals("Organization", json2.getJsonObject("publisher").getString("@type"));
assertEquals("LibraScholar", json2.getJsonObject("provider").getString("name"));
assertEquals("LibraScholar", json2.getJsonObject("publisher").getString("name"));
assertEquals("Organization", json2.getJsonObject("provider").getString("@type"));
assertEquals("LibraScholar", json2.getJsonObject("provider").getString("name"));
assertEquals("Organization", json2.getJsonArray("funder").getJsonObject(0).getString("@type"));
Expand Down
Loading

0 comments on commit 8ec1572

Please sign in to comment.