Skip to content

Commit

Permalink
add extra info to ingestion
Browse files Browse the repository at this point in the history
  • Loading branch information
pvgenuchten committed Nov 29, 2024
1 parent bf185aa commit ee2ea05
Showing 1 changed file with 17 additions and 5 deletions.
22 changes: 17 additions & 5 deletions tech/docs/technical_components/ingestion.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,13 +95,23 @@ A second mechanism is available to link from Cordis to OpenAire, the RCN number.

Not all DOI's registered in Cordis are available in OpenAire. OpenAire only lists resources with an open access license. Other DOI's can be fetched from the DOI registry directly or via Crossref.org. This work is still in preparation.


#### OGC-CSW

Many (spatial) catalogues advertise their metadata via the [catalogue Service for the Web](https://www.ogc.org/standard/cat/){target=_blank} standard, such as INSPIRE GeoPortal, Bonares, ISRIC. The [OWSLib](https://github.com/geopython/owslib) library is used to query records from CSW endpoints. A filter can be configured to retrieve subsets of the catalogue.


#### INSPIRE

Although [INSPIRE Geoportal](https://inspire-geoportal.ec.europa.eu/){target=_blank} does offer a CSW endpoint, due to a technical reason, we have not been able to harvest from it. Instead we have developed a dedicated harvester via the Elastic Search API endpoint of the Geoportal. If at some point the technical issue has been resolved, use of the CSW harvest endpoint is favourable.
Although [INSPIRE Geoportal](https://inspire-geoportal.ec.europa.eu/){target=_blank} does offer a CSW endpoint, due to a technical reasons, we have not been able to harvest from it. Instead we have developed a dedicated harvester via the Elastic Search API endpoint of the Geoportal. If at some point the technical issue has been resolved, use of the CSW harvest endpoint is favourable.

#### ESDAC

The [ESDAC catalogue](https://esdac.jrc.ec.europa.eu/){target=_blank} is an instance of Drupal CMS. The site does offers some RDFa annotations. We have developed a dedicated harvester to scrape html elements and RDFa to extract records from ESDAC.
The [ESDAC catalogue](https://esdac.jrc.ec.europa.eu/){target=_blank} is an instance of Drupal CMS. We have developed a dedicated harvester to scrape html elements to extract Dublin Core metadata from ESDAC html elements. Metadata is extracted for datasets, maps (EUDASM) and documents. Incidentally a DOI is mentioned as part of the HTML, this DOI is then used as identifier for the resource, else the resource url is used as identifier. If the DOI is not known to the system yet, OpenAire will be queried to capture additional metadata on the resource.

#### Impact4Soil

Impact4soil is build on a Strapi.io headless CMS. The CMS provides an API to retrieve datasets and scientific articles. The API provides minimal metadata, but fortunately in most cases a DOI is included. DOI is used to capture additional metadata from OpenAire.

### Metadata Harmonization

Expand All @@ -118,10 +128,14 @@ Table below indicates the various source models supported

Metadata is harmonised to a [DCAT](https://www.w3.org/TR/vocab-dcat-3/){target=_blank} RDF representation.

For metadata harmonization some supporting modules are used, [owslib](https://owslib.readthedocs.io/en/latest/){target=_blank} is a module to parse various source metadata models, including iso19115:2005. [pygeometa](https://github.com/geopython/pygeometa){target=_blank} is a module which can export owslib parsed metadata to various outputs, including DCAT.
For metadata harmonization some supporting modules are used, [owslib](https://owslib.readthedocs.io/en/latest/){target=_blank} is a module to parse various source metadata models, including iso19139:2007. A transformation script from (semic-eu/iso19139-to-dcat-ap.xslt)[https://github.com/semic-eu/iso19139-to-dcat-ap/] in combination with lxml and rdflib is used to convert iso19139:2007 metadata to RDF, serialised as turtle.

Harmonised metadata is either transformed to iso19139:2007 or Dublin Core and then ingested by the pycsw software, used to power the [SoilWise Catalogue](catalogue.md), using an automated process running at intervals. At this moment the pycsw catalogue software requires a dedicated database structure. This step converts the harmonised metadata database to that model. In next iterations we aim to remove this step and enable the catalogue to query the harmnised model directly.

#### Metadata Augmentation

The metadata augmentation processes are described [elsewhere](metadata_augmentation.md), what is relevant here is that the output of these processes is integrated in the harmonised metadata database.

### Metadata RDF turtle serialization

The harmonised metadata model is based on the DCAT ontology. In this step the content of the database is written to RDF.
Expand All @@ -138,8 +152,6 @@ A resource can be described in multiple Catalogues, identified by a common ident

Visualization of source repositories is in the first development iteration available as a dedicated section in the [SoilWise Catalogue](catalogue.md).

![Sources section](../_assets/images/sources-section-catalogue.png)

## Technology

### Git actions/pipelines to run harvest tasks
Expand Down

0 comments on commit ee2ea05

Please sign in to comment.