Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ESS-DIVE suite #423

Closed
gothub opened this issue Oct 18, 2021 · 10 comments
Closed

Update ESS-DIVE suite #423

gothub opened this issue Oct 18, 2021 · 10 comments
Assignees
Labels
Milestone

Comments

@gothub
Copy link
Contributor

gothub commented Oct 18, 2021

Description

Update the ESS-DIVE assessment suite based on the checks described here.

The new version number will be 1.1.0
Note that some of these checks are new and some are currently part of the FAIR suite.

List of checks for version 1.1.0:

Description Check Id / Issue Category Level Implemented? Comments
Project name (title) controlled ressource.projectTitle.controlled Findable Optional Y new for suite v1.1.0
Any links included in metadata resolve resource.URLs.resolvable Accessible Required Y new for suite v1.1.0
Dataset Landing Page Resolves resource.landingPage.present Accessible Required Not for ESS-DIVE Suite 1.1.0 suite same as #188?
Existing DOI and Alternate Identifier Existing DOI and Alternate Identifier, metadata.identifier.resolvable Accessible Required Not for ESS-DIVE Suite 1.1.0 The DOI portion of this check is already implemented as #60
Project funder is controlled resource.awardFunderName.controlled Findable Optional Y new for suite v1.1.0
File formats are not proprietary entity.type.nonproprietary Interoperable Optional Y new for suite v1.1.0
File present File present, check.entity.present Interoperable INFO Y new for suite v1.1.0
DOI and alternate identifier metadata.alternateIdentifier.resolvable Findable Optional Not for ESS-DIVE Suite 1.1.0
Metadata identifier is resolvable metadata.identifier.resolvable Accessible Optional Y new for suite v1.1.0
check.creator.present Findable REQUIRED Y
check.creator.info Findable REQUIRED Y
contact.has.ORCID Findable REQUIRED Y
check.contact.info Findable REQUIRED Y
check.abstract.100.words Findable REQUIRED Y
dataset.keywords.minimum Findable REQUIRED Y
dataset.keywords.overlap Findable OPTIONAL Y
dataset.title.length2 Findable REQUIRED Y
check.identifier.is.present Findable REQUIRED Y
check.usage.is.cc Reusable OPTIONAL Y
check.temporal.coverage Findable OPTIONAL Y
check.geographic.description Findable REQUIRED Y
check.bounding.coordinates Findable REQUIRED Y
check.pub.date Findable REQUIRED Y
check.methods.present Interoperable REQUIRED Y
@gothub gothub added ESS-DIVE Tasks important to ESS-DIVE metadig labels Oct 18, 2021
@gothub gothub self-assigned this Oct 18, 2021
@gothub gothub added Epic Epic and removed Epic labels Oct 26, 2021
@gothub gothub added this to the v0.4.0 milestone Oct 26, 2021
@gothub
Copy link
Contributor Author

gothub commented Nov 2, 2021

@vchendrix has requested that the new suite be version 1.1.0

@gothub
Copy link
Contributor Author

gothub commented Nov 30, 2021

Discussion topics for the ESS-DIVE Suite update

ESS-DIVE Suite

  • Can any of these checks that are denoted as new, ESS-DIVE specific checks
    use a modified version of an existing check?

  • The ESS-DIVE suite has the check categories ('types') of

    • Identification
    • Discovery
    • Interpretation
  • Do you want to start using the FAIR categories? They are:

    • Findable
    • Accessible
    • Interoperable
    • Reusable
  • Potential categories for the checks are listed on the spreadsheet at https://docs.google.com/spreadsheets/d/14v3hjPL9jDSgfSF6RCyDgwEKZIK4xwAzJRCb0zjTsfg/edit#gid=1908621675

  • if using FAIR categories, then all other ESS-DIVE checks need to be re-categorized

Checks - the latest update to this section is from a 2022 01 05 meeting with Joan, Emily and Peter

  • resource.projectTitle.controlled (Project name accuracy)
  • resource.URLs.resolvable (any links included in metadata resolve)
    • URLs will be extracted from these EML elements: abstract, location description, methods, and related references
      • the check will look at robots.txt to determine website policy
      • check if many URLs go to the same sites
      • the check will use a "backoff" algorithm:
        • if many unique URLs are found, limit the number of HTTP requests per unit time
      • todo: essdive provides valid domain list
        • if provided, only URLs from this list will be checked
        • this list may be difficult to determine, but will be provided and incorporated into check if feasible
      • the check output will indicate the total number of URLs found and the number that were unresolvable
  • resource.landingPage.present (dataset landing page is specified and resolves)
    • the EML element checked: /eml/dataset/distribution/online/url[@function="information"]
    • essdive: not ready for this check yet, but possibly incorporate into suite at a later date
  • alternate.identifier.resolvable (existing DOI and alternate Identifier) Issue Quality Report Check: Existing DOI and Alternate Identifier #427
    • essdive: if series id is a DOI, then see if it resolves
    • essdive: check: if alt id is a DOI, then see if it resolves, check is optional
    • essdive: currently this is a manual check to verify that the alternate id resolves and that the referenced web page is relavant to the dataset described in the metadata.
      • todo: Joan will check if this can be automated into a task that the check can perform
  • resource.funderName.controlled
    • essdive: this check will not be included in the 1.1.0 suite, but may be added at a later date
    • ESS-DIVE not currently using EML 2.2 awarad fields
    • /eml/dataset/project/funding is available
      • for example: "DOE:AC05-00OR22725"
  • check.entity.present (File Present)
  • entity.format.nonproprietary
    • essdive: add this check to the suite, but other formats may be added to the known formats list
    • the current msg is: "These 10 proprietary data entity formats (out of 100 total formats) were found: , , ..."
    • the desired msg is " ESS-DIVE recommends the use of non-proprietary file formats where possible. Review the [name file types included] file types included in your dataset and consider changing them to non-proprietary formats."

@vchendrix
Copy link
Contributor

gothub added a commit that referenced this issue Jan 4, 2022
@JEDamerow
Copy link
Contributor

@gothub For the resource.URLs.resolvable check - can we add that the output includes up to 3 specific urls that were not resolvable?

For resource.projectTitle.controlled - when there is no exact match for the entered project name, add suggested method for users to find project name (Emily to send text for this)

@gothub
Copy link
Contributor Author

gothub commented Jan 6, 2022

@JEDamerow yes - I'll update resource.URLs.resolvable and resource.projectTitle.controlled as you described above.

gothub added a commit that referenced this issue Jan 6, 2022
Issue #423 Remove 'resource.landingPage.present' check from the ESS-DIVE 1.1.0 suite
@JEDamerow
Copy link
Contributor

@gothub we decided to include a funder check to make sure that at least one specific funding source is listed "U.S. DOE > Office of Science > Biological and Environmental Research (BER)" Adding a note in that issue.

gothub added a commit that referenced this issue Jan 21, 2022
Issue #423 Update the ESS-DIVE 1.1.0 suite file, so that check '<level>' settings match the 1.0 suite.
gothub added a commit that referenced this issue Jan 21, 2022
gothub added a commit that referenced this issue Jan 25, 2022
gothub added a commit that referenced this issue Feb 2, 2022
@JEDamerow
Copy link
Contributor

We did a spot check today of some of the assessment reports, and found a few issues to fix, or discuss.

Notes on spot check issues encountered, with screenshots.
List of datasets that we checked, with notes on any specific issues

Summary of Issues:

Proprietary File check
Showing up correctly, but may want to add a link to documentation describing what to do for non-proprietary formats (FOR ESS-DIVE TEAM TO DO)
EXCEL FILES - Common use case will be for excel files, and we have provided guidance that if you have/need to include an excel file, also include a csv version
PROPOSED SOLUTION: Have a second follow up check for excel files - If there is an excel file, check for matching csv. OR have a message for all excel files “If you have not already, please also upload a csv version of your excel file”

Project Name check
Change to an optional check to show up under warnings, and remove “Warning:” from the beginning of response

Private Datasets - “An Identifier”, and not running checks
Identifier issues - “An identifier” is not present, “metadata identifier” not resolving
Fail message: “An identifier is not present.” What does this mean?
One private dataset (​​ess-dive-f718f1571e4ac29-20220108T005614833) that did not run checks, and had 21 informational checks with “dialect for the check is not supported”

Private dataset - Metadata identifier
https://data.ess-dive.lbl.gov/view/ess-dive-ecc15a4f31c86ef-20211221T212550778
Failed check response: “The metadata identifier ess-dive-ecc15a4f31c86ef-20211221T212550778’ was found), but is not resolvable”
This is an ESS-DIVE issue right? Move to optional?
Remove the “)” after found

URL in metadata
ISSUE: http://localhost:3000/view/doi:10.15485/1631979 … URL in metadata did not resolve correctly, because there was a comma (from sentence in abstract)
SOLUTION: Need to parse that out. Actually does resolve when you remove the comma from the end

Funding Organization
Failed check message: “The funding organization BER was not found in the metadata”
Change to optional, listed in warnings

@gothub
Copy link
Contributor Author

gothub commented Feb 10, 2022

@JEDamerow @emilyarobles thanks for the review.

Regarding your comments:

Proprietary File check
The proposed changes can be made, but we will have to create a new check that is unique to the ESS-DIVE suite instead of using the same check that is currently used by the FAIR suite.
The second check can be added (if excel file, check for matching csv), as an additional check.

TODO: Need to determine a name for this new check

Project Name check
This check will be changed to "OPTIONAL"

Private Datasets - “An Identifier”, and not running checks

Identifier issues - “An identifier” is not present, “metadata identifier” not resolving
Fail message: “An identifier is not present.” What does this mean?

This check simply looks for '/eml/@packageId' which is the identifier associated with the metadata document. Which dataset had this error?

One private dataset (​​ess-dive-f718f1571e4ac29-20220108T005614833) that did not run checks, and had 21 informational checks with “dialect for the check is not supported”

This was due to a processing error that has been resolved, and shouldn't happen again.

Private dataset - Metadata identifier

https://data.ess-dive.lbl.gov/view/ess-dive-ecc15a4f31c86ef-20211221T212550778
Failed check response: “The metadata identifier ess-dive-ecc15a4f31c86ef-20211221T212550778’ was found), but is not resolvable”
This is an ESS-DIVE issue right? Move to optional?

The metadig engine has privilege to read private datasets, but the checks themselves run as unprivileged, so the call in the check to see if the URL HTTP 'Head' request is successful will not succeed for private datasets. This is really an NCEAS problem, and I will log an issue for it. It may be helpful to provide a more useful message if an HTTP 401 (Not authorized) message is returned. The check would have to be updated to detect this.

Remove the “)” after found

will do, thx.

URL in metadata

ISSUE: http://localhost:3000/view/doi:10.15485/1631979 … URL in metadata did not resolve correctly, because there was a comma (from sentence in abstract)
SOLUTION: Need to parse that out. Actually does resolve when you remove the comma from the end

yes, that is indeed the problem, the check will be fixed.

Funding Organization

Failed check message: “The funding organization BER was not found in the metadata”
Change to optional, listed in warnings.

OK, the check will be updated.

Please let me know if we need to discuss these further, or if any of your questions haven't been answered sufficiently.

gothub added a commit that referenced this issue Feb 14, 2022
Issue #423 Changed these checks to optional:
- resource.projectTitle.controlled
- resource.awardFunderName.controlled
- metadata.identifier.resolvable
- entity.type.nonproprietary
gothub added a commit that referenced this issue Feb 15, 2022
@gothub
Copy link
Contributor Author

gothub commented Feb 16, 2022

@Val @JEDamerow issues reported with the new checks have been resolved and the ESS-DIVE 1.1.0 suite has been run for all current metadata.

However, for at least one metadata document, there is an issue with the mediaType (formatId) associated with the EML entity. For example, for 'https://data.ess-dive.lbl.gov/view/ess-dive-fe751bc7ce7851b-20210930T015111122080':

    <otherEntity id="ess-dive-b4c4a1e743e0add-20190514T142445934">
      <entityName>PLOT_19_DPH_Response_Variables_20150814.csv</entityName>
      <entityType>application/vnd.ms-excel</entityType>
    </otherEntity>

The sysmeta for this entry has the formatId as:

<formatId>application/vnd.ms-excel</formatId>

I downloaded the file, and it is definitely a CSV, so both the metadata and sysmeta should identify this as text/csv.

There may be other metadata/sysmeta that have this same problem.

The check works as it is supposed to, as it uses the mediaType recorded in the metadata.

I'll continue to look for other pids that have this problem.

@gothub
Copy link
Contributor Author

gothub commented Feb 24, 2022

The ESS-DIVE 1.1.0 suite is now running on production k8s and the release is available here.

@gothub gothub closed this as completed Feb 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants