Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IQSS/7349-4 creator updates in schema.org #9089

Merged
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
0b375cf
add type for person/org, add sameas, fix affiliation
qqmyers Oct 18, 2022
5bd58d8
typo
qqmyers Oct 18, 2022
63cd77d
capitalization
qqmyers Oct 19, 2022
8084fb8
update tests
qqmyers Oct 19, 2022
489d0e3
legacy test issue
qqmyers Oct 19, 2022
c3260a5
change fullname -> fullName
qqmyers Oct 19, 2022
3ddc796
note todos
qqmyers Oct 19, 2022
05ea63a
add tests
qqmyers Oct 19, 2022
6ca9f70
don't send giveName for orgs
qqmyers Oct 19, 2022
a5fafd0
release note
qqmyers Oct 19, 2022
f222160
bugfix for no givenName/familyName from algorithm
qqmyers Oct 20, 2022
41c30d9
add assumeCommaInPersonName and tests
qqmyers Oct 21, 2022
d5d3655
update docs/release note
qqmyers Oct 21, 2022
ebb1380
added org Phrases for DANS
qqmyers Oct 28, 2022
4dcd8ed
fix affiliation value (no parens)
qqmyers Oct 28, 2022
0184b3d
logic fix
qqmyers Oct 28, 2022
545a295
comma check shouldn't override isPerson
qqmyers Oct 28, 2022
ab2326c
always set givenName null for Org
qqmyers Oct 28, 2022
0d54106
optimize - break out of loop when done
qqmyers Oct 28, 2022
1d935fe
documentation of new options
qqmyers Oct 28, 2022
a5ae4d7
add labels
qqmyers Oct 28, 2022
ab7bb33
Merge remote-tracking branch 'IQSS/develop' into IQSS/7349-4_creator_…
qqmyers Nov 7, 2022
4bec5f6
merge
qqmyers Dec 14, 2022
839376d
Merge remote-tracking branch 'IQSS/develop' into IQSS/7349-4_creator_…
qqmyers Jan 20, 2023
ec10bad
Merge remote-tracking branch 'IQSS/develop' into IQSS/7349-4_creator_…
qqmyers Jan 30, 2023
216fb8a
Merge remote-tracking branch 'IQSS/develop' into
qqmyers Jan 31, 2023
863be65
merge fixes
qqmyers Jan 31, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions doc/release-notes/7349-4-schema.org-updates.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
The Schema.org metadata export and the schema.org metadata embedded in dataset pages has been updated to improve compliance with Schema.org's schema and Google's recommendations.

New jvm-option: dataverse.personOrOrg.assumeCommaInPersonName, default is false

Backward compatibility - author/creators now have an @type of Person or Organization and any affiliation (affiliation for Person, parentOrganization for Organization) is now an object of @type Organization
10 changes: 10 additions & 0 deletions doc/sphinx-guides/source/admin/metadataexport.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,3 +57,13 @@ Downloading Metadata via API
----------------------------

The :doc:`/api/native-api` section of the API Guide explains how end users can download the metadata formats above via API.

Exporter Configuration
----------------------

Two exporters - Schema.org JSONLD and OpenAire - use an algorithm to determine whether an author, or contact, name belongs to a person or organization. While the algorithm works well, there are cases in which it makes mistakes, usually inferring that an organization is a person.

The Dataverse software implements two jvm-options that can be used to tune the algorithm:

- :ref:`dataverse.personOrOrg.assumeCommaInPersonName` - boolean, default false. If true, Dataverse will assume any name without a comma must be an organization. This may be most useful for curated Dataverse instances that enforce the "family name, given name" convention.
- :ref:`dataverse.personOrOrg.orgPhraseArray` - a JsonArray of strings. Any name that contains one of the strings is assumed to be an organization. For example, "Project" is a word that is not otherwise associated with being an organization.
23 changes: 21 additions & 2 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2016,8 +2016,6 @@ By default, download URLs to files will be included in Schema.org JSON-LD output

``./asadmin create-jvm-options '-Ddataverse.files.hide-schema-dot-org-download-urls=true'``

Please note that there are other reasons why download URLs may not be included for certain files such as if a guestbook entry is required or if the file is restricted.

For more on Schema.org JSON-LD, see the :doc:`/admin/metadataexport` section of the Admin Guide.

.. _useripaddresssourceheader:
Expand Down Expand Up @@ -2047,6 +2045,27 @@ This setting is useful in cases such as running your Dataverse installation behi
"HTTP_FORWARDED",
"HTTP_VIA",
"REMOTE_ADDR"

.. _dataverse.personOrOrg.assumeCommaInPersonName:

dataverse.personOrOrg.assumeCommaInPersonName
+++++++++++++++++++++++++++++++++++++++++++++

Please note that this setting is experimental.

The Schema.org metadata and OpenAIRE exports and the Schema.org metadata included in DatasetPages try to infer whether each entry in the various fields (e.g. Author, Contributor) is a Person or Organization. If you are sure that
users are following the guidance to add people in the recommended family name, given name order, with a comma, you can set this true to always assume entries without a comma are for Organizations. The default is false.

.. _dataverse.personOrOrg.orgPhraseArray:

dataverse.personOrOrg.orgPhraseArray
++++++++++++++++++++++++++++++++++++

Please note that this setting is experimental.

The Schema.org metadata and OpenAIRE exports and the Schema.org metadata included in DatasetPages try to infer whether each entry in the various fields (e.g. Author, Contributor) is a Person or Organization.
If you have examples where an orgization name is being inferred to belong to a person, you can use this setting to force it to be recognized as an organization.
The value is expected to be a JsonArray of strings. Any name that contains one of the strings is assumed to be an organization. For example, "Project" is a word that is not otherwise associated with being an organization.


.. _dataverse.api.signature-secret:
Expand Down
52 changes: 36 additions & 16 deletions src/main/java/edu/harvard/iq/dataverse/DatasetVersion.java
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
package edu.harvard.iq.dataverse;

import edu.harvard.iq.dataverse.util.MarkupChecker;
import edu.harvard.iq.dataverse.util.PersonOrOrgUtil;
import edu.harvard.iq.dataverse.util.BundleUtil;
import edu.harvard.iq.dataverse.DatasetFieldType.FieldType;
import edu.harvard.iq.dataverse.branding.BrandingUtil;
Expand Down Expand Up @@ -1802,27 +1803,46 @@ public String getJsonLd() {
for (DatasetAuthor datasetAuthor : this.getDatasetAuthors()) {
JsonObjectBuilder author = Json.createObjectBuilder();
String name = datasetAuthor.getName().getDisplayValue();
String identifierAsUrl = datasetAuthor.getIdentifierAsUrl();
DatasetField authorAffiliation = datasetAuthor.getAffiliation();
String affiliation = null;
if (authorAffiliation != null) {
affiliation = datasetAuthor.getAffiliation().getDisplayValue();
}
// We are aware of "givenName" and "familyName" but instead of a person it might be an organization such as "Gallup Organization".
//author.add("@type", "Person");
author.add("name", name);
// We are aware that the following error is thrown by https://search.google.com/structured-data/testing-tool
// "The property affiliation is not recognized by Google for an object of type Thing."
// Someone at Google has said this is ok.
// This logic could be moved into the `if (authorAffiliation != null)` block above.
if (!StringUtil.isEmpty(affiliation)) {
author.add("affiliation", affiliation);
affiliation = datasetAuthor.getAffiliation().getValue();
}
String identifierAsUrl = datasetAuthor.getIdentifierAsUrl();
if (identifierAsUrl != null) {
// It would be valid to provide an array of identifiers for authors but we have decided to only provide one.
author.add("@id", identifierAsUrl);
author.add("identifier", identifierAsUrl);
JsonObject entity = PersonOrOrgUtil.getPersonOrOrganization(name, false, (identifierAsUrl!=null));
String givenName= entity.containsKey("givenName") ? entity.getString("givenName"):null;
String familyName= entity.containsKey("familyName") ? entity.getString("familyName"):null;

if (entity.getBoolean("isPerson")) {
// Person
author.add("@type", "Person");
if (givenName != null) {
author.add("givenName", givenName);
}
if (familyName != null) {
author.add("familyName", familyName);
}
if (!StringUtil.isEmpty(affiliation)) {
author.add("affiliation", Json.createObjectBuilder().add("@type", "Organization").add("name", affiliation));
}
//Currently all possible identifier URLs are for people not Organizations
if(identifierAsUrl != null) {
author.add("sameAs", identifierAsUrl);
//Legacy - not sure if these are still useful
author.add("@id", identifierAsUrl);
author.add("identifier", identifierAsUrl);

}
} else {
// Organization
author.add("@type", "Organization");
if (!StringUtil.isEmpty(affiliation)) {
author.add("parentOrganization", Json.createObjectBuilder().add("@type", "Organization").add("name", affiliation));
}
}
// Both cases
author.add("name", entity.getString("fullName"));
//And add to the array
authors.add(author);
}
JsonArray authorsArray = authors.build();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,10 @@ public static void writeCreatorsElement(XMLStreamWriter xmlw, DatasetVersionDTO
creator_map.put("nameType", "Personal");
nameType_check = true;
}

// ToDo - the algorithm to determine if this is a Person or Organization here
// has been abstracted into a separate
// edu.harvard.iq.dataverse.util.PersonOrOrgUtil class that could be used here
// to avoid duplication/variants of the algorithm
creatorName = Cleanup.normalize(creatorName);
// Datacite algorithm, https://github.com/IQSS/dataverse/issues/2243#issuecomment-358615313
if (creatorName.contains(",")) {
Expand Down Expand Up @@ -706,6 +709,11 @@ public static void writeContributorElement(XMLStreamWriter xmlw, String contribu
boolean nameType_check = false;
Map<String, String> contributor_map = new HashMap<String, String>();

// ToDo - the algorithm to determine if this is a Person or Organization here
// has been abstracted into a separate
// edu.harvard.iq.dataverse.util.PersonOrOrgUtil class that could be used here
// to avoid duplication/variants of the algorithm

contributorName = Cleanup.normalize(contributorName);
// Datacite algorithm, https://github.com/IQSS/dataverse/issues/2243#issuecomment-358615313
if (contributorName.contains(",")) {
Expand All @@ -717,6 +725,9 @@ public static void writeContributorElement(XMLStreamWriter xmlw, String contribu
// givenName ok
contributor_map.put("nameType", "Personal");
nameType_check = true;
// re: the above toDo - the ("ContactPerson".equals(contributorType) &&
// !isValidEmailAddress(contributorName)) clause in the next line could/should
// be sent as the OrgIfTied boolean parameter
} else if (isOrganization || ("ContactPerson".equals(contributorType) && !isValidEmailAddress(contributorName))) {
contributor_map.put("nameType", "Organizational");
}
Expand Down
155 changes: 155 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/util/PersonOrOrgUtil.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
package edu.harvard.iq.dataverse.util;

import java.util.ArrayList;
import java.util.List;
import java.util.logging.Logger;

import javax.json.JsonArray;
import javax.json.JsonObject;
import javax.json.JsonObjectBuilder;
import javax.json.JsonString;

import edu.harvard.iq.dataverse.export.openaire.Cleanup;
import edu.harvard.iq.dataverse.export.openaire.FirstNames;
import edu.harvard.iq.dataverse.export.openaire.Organizations;
import edu.harvard.iq.dataverse.util.json.JsonUtil;
import edu.harvard.iq.dataverse.util.json.NullSafeJsonBuilder;

/**
*
* @author qqmyers
*
* Adapted from earlier code in OpenAireExportUtil
*
* Implements an algorithm derived from code at DataCite to determine
* whether a name is that of a Person or Organization and, if the
* former, to pull out the given and family names.
*
* Adds parameters that can improve accuracy:
*
* * e.g. for curated repositories, allowing the code to assume that all
* Person entries are in <family name>, <given name> order.
*
* * allow local configuration of specific words/phrases that will
* automatically categorize one-off cases that the algorithm would
* otherwise mis-categorize. For example, the code appears to not
* recognize names ending in "Project" as an Organization.
*
*/

public class PersonOrOrgUtil {

private static final Logger logger = Logger.getLogger(PersonOrOrgUtil.class.getCanonicalName());

static boolean assumeCommaInPersonName = false;
static List<String> orgPhrases;

static {
setAssumeCommaInPersonName(Boolean.parseBoolean(System.getProperty("dataverse.personOrOrg.assumeCommaInPersonName", "false")));
setOrgPhraseArray(System.getProperty("dataverse.personOrOrg.orgPhraseArray", null));
}

/**
* This method tries to determine if a name belongs to a person or an
* organization and, if it is a person, what the given and family names are. The
* core algorithm is adapted from a Datacite algorithm, see
* https://github.com/IQSS/dataverse/issues/2243#issuecomment-358615313
*
* @param name
* - the name to test
* @param organizationIfTied
* - if a given name isn't found, should the name be assumed to be
* from an organization. This could be a generic true/false or
* information from some non-name aspect of the entity, e.g. which
* field is in use, or whether a .edu email exists, etc.
* @param isPerson
* - if this is known to be a person due to other info (i.e. they
* have an ORCID). In this case the algorithm is just looking for
* given/family names.
* @return
*/
public static JsonObject getPersonOrOrganization(String name, boolean organizationIfTied, boolean isPerson) {
name = Cleanup.normalize(name);

String givenName = null;
String familyName = null;

boolean isOrganization = !isPerson && Organizations.getInstance().isOrganization(name);
if (!isOrganization) {
for (String phrase : orgPhrases) {
if (name.contains(phrase)) {
isOrganization = true;
break;
}
}
}
if (name.contains(",")) {
givenName = FirstNames.getInstance().getFirstName(name);
// contributorName=<FamilyName>, <FirstName>
if (givenName != null && !isOrganization) {
// givenName ok
isOrganization = false;
// contributor_map.put("nameType", "Personal");
if (!name.replaceFirst(",", "").contains(",")) {
// contributorName=<FamilyName>, <FirstName>
String[] fullName = name.split(", ");
givenName = fullName[1];
familyName = fullName[0];
}
} else if (isOrganization || organizationIfTied) {
isOrganization = true;
givenName = null;
}

} else {
if (assumeCommaInPersonName && !isPerson) {
isOrganization = true;
} else {
givenName = FirstNames.getInstance().getFirstName(name);

if (givenName != null && !isOrganization) {
isOrganization = false;
if (givenName.length() + 1 < name.length()) {
familyName = name.substring(givenName.length() + 1);
}
} else {
// default
if (isOrganization || organizationIfTied) {
isOrganization = true;
givenName=null;
}
}
}
}
JsonObjectBuilder job = new NullSafeJsonBuilder();
job.add("fullName", name);
job.add("givenName", givenName);
job.add("familyName", familyName);
job.add("isPerson", !isOrganization);
return job.build();

}

// Public for testing
public static void setOrgPhraseArray(String phraseArray) {
orgPhrases = new ArrayList<String>();
if (!StringUtil.isEmpty(phraseArray)) {
try {
JsonArray phrases = JsonUtil.getJsonArray(phraseArray);
phrases.forEach(val -> {
JsonString strVal = (JsonString) val;
orgPhrases.add(strVal.getString());
});
} catch (Exception e) {
logger.warning("Could not parse Org phrase list");
}
}

}

// Public for testing
public static void setAssumeCommaInPersonName(boolean assume) {
assumeCommaInPersonName = assume;
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -139,13 +139,15 @@ public void testExportDataset() throws Exception {
assertEquals("https://doi.org/10.5072/FK2/IMK5A4", json2.getString("identifier"));
assertEquals("Darwin's Finches", json2.getString("name"));
assertEquals("Finch, Fiona", json2.getJsonArray("creator").getJsonObject(0).getString("name"));
assertEquals("Birds Inc.", json2.getJsonArray("creator").getJsonObject(0).getString("affiliation"));
assertEquals("Birds Inc.", json2.getJsonArray("creator").getJsonObject(0).getJsonObject("affiliation").getString("name"));
assertEquals("https://orcid.org/0000-0002-1825-0097", json2.getJsonArray("creator").getJsonObject(0).getString("@id"));
assertEquals("https://orcid.org/0000-0002-1825-0097", json2.getJsonArray("creator").getJsonObject(0).getString("identifier"));
assertEquals("https://orcid.org/0000-0002-1825-0097", json2.getJsonArray("creator").getJsonObject(0).getString("sameAs"));
assertEquals("Finch, Fiona", json2.getJsonArray("author").getJsonObject(0).getString("name"));
assertEquals("Birds Inc.", json2.getJsonArray("author").getJsonObject(0).getString("affiliation"));
assertEquals("Birds Inc.", json2.getJsonArray("author").getJsonObject(0).getJsonObject("affiliation").getString("name"));
assertEquals("https://orcid.org/0000-0002-1825-0097", json2.getJsonArray("author").getJsonObject(0).getString("@id"));
assertEquals("https://orcid.org/0000-0002-1825-0097", json2.getJsonArray("author").getJsonObject(0).getString("identifier"));
assertEquals("https://orcid.org/0000-0002-1825-0097", json2.getJsonArray("author").getJsonObject(0).getString("sameAs"));
assertEquals("1955-11-05", json2.getString("datePublished"));
assertEquals("1955-11-05", json2.getString("dateModified"));
assertEquals("1", json2.getString("version"));
Expand All @@ -170,7 +172,7 @@ public void testExportDataset() throws Exception {
assertEquals("LibraScholar", json2.getJsonObject("includedInDataCatalog").getString("name"));
assertEquals("https://librascholar.org", json2.getJsonObject("includedInDataCatalog").getString("url"));
assertEquals("Organization", json2.getJsonObject("publisher").getString("@type"));
assertEquals("LibraScholar", json2.getJsonObject("provider").getString("name"));
assertEquals("LibraScholar", json2.getJsonObject("publisher").getString("name"));
assertEquals("Organization", json2.getJsonObject("provider").getString("@type"));
assertEquals("LibraScholar", json2.getJsonObject("provider").getString("name"));
assertEquals("Organization", json2.getJsonArray("funder").getJsonObject(0).getString("@type"));
Expand Down
Loading