Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add harvesting of schema.org #16

Open
steingod opened this issue Sep 27, 2022 · 3 comments
Open

Add harvesting of schema.org #16

steingod opened this issue Sep 27, 2022 · 3 comments
Assignees

Comments

@steingod
Copy link
Owner

Add support for harvesting of schema.org records. Need to decide on whether this is a separate function or to be integrated in the ordinary harvester. There are quite some differences in how the harvesting is done, so could be that a separate harvester should be used.

@steingod steingod self-assigned this Sep 27, 2022
@steingod
Copy link
Owner Author

Harvesting from NSF ADC works, working on adaptations for GEM. Those are the two working examples for the time being.

@ferrighi
Copy link
Collaborator

ferrighi commented Feb 18, 2025

Some parsing issues to fix:

  • parse only records of type "Dataset"
  • handing of identifier field (e.g. type url or PropertyValue (prefixed or not))
  • handle missing title
  • description can be text or dict
  • temporalCoverage can be of different formats and with different separators (slash or dash)
  • datePublished can be of different formats, sometimes only a year
  • keywords: can be list of strings, string (comma or colon separated), list of dict (of DefinedTerm), check for empty list
  • variableMeasured similar to keywords
  • spatialCoverage: should use GeoCoordinate or GeoShape (S W N E) space separated, but it should support also (W,S E,N)
  • add module vocab to validate license
  • add conditionsOfAccess where "unrestricted" can be mapped to "Open" (access_constraint)
  • distribution, encodingFormat is not mandatory
  • add some default MMD values and follow MMD sequence

Some technical issues:

  • avoid for now using requests-html (it is slow and it installs pyppeteer by default), use requests/json instead. Lazy loading will be handle in a future version
  • add the possibility to harvest only new records
  • add possibility to harvest sitemapindex and sitemap (PANGAEA)
  • add possibility to harvest html documents directly (GEM)

@ferrighi
Copy link
Collaborator

I've made a branch https://github.com/steingod/mdharvest/tree/issue16 with the current modifications. It is working for:

for harvesting only the records of the last X days the "-l X" can be used. This is if the lastmod is provided.

I will wait for a PR as I want to add a unique function to check datetime a bit more consistently for datePublished, dateModified, temporalCoverage. But the branch can be used for testing.

@ferrighi ferrighi mentioned this issue Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants