Skip to content

Commit

Permalink
config.yaml info in opendataharvest doc
Browse files Browse the repository at this point in the history
  • Loading branch information
srappel authored Jun 3, 2024
1 parent 54fd885 commit 130ed8a
Showing 1 changed file with 87 additions and 0 deletions.
87 changes: 87 additions & 0 deletions docs/utils/opendataharvest.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,3 +58,90 @@ parent: GeoDiscovery Utilities
Landing Page:
DCAT: landingPage
OGM Aardvark: dct_isPartOf_sm

## The [config.yaml](https://github.com/UWM-Libraries/GeoDiscovery-Utils/blob/main/opendataharvest/config.yaml) file.

YAML is a human-readable data serialization language.
It is commonly used for configuration files.
When used in conjunction with a python script, python can fetch these values as a dictionary object, allowing easy
access to the values.

The opendataharvest tool gets both it's configuration parameters (e.g. where to store output and logs),
default values for fields,
and manifests of open data sites to harvest from.

You can see the configuration options at the top:

```yaml
CONFIG:
CATALOG: "DCAT_Sites" # TestSites, DCAT_Sites, or CKAN_Sites
OUTPUTDIR: "opendataharvest/output_md"
LOGDIR: "opendataharvest/log"
DEFAULTBBOX: "opendataharvest/default_bbox.csv"
MAXRETRY: 3
SLEEPTIME: 2
SCHEMA: "https://raw.githubusercontent.com/UWM-Libraries/GeoDiscovery/main/schema/geoblacklight-schema-aardvark.json"
```
Next the default values are set in the "Localization" section:
```yaml
DEFAULT:
MEMBEROF:
- "AGSLOpenDataHarvest"
RESOURCECLASS:
- "Datasets"
ACCESSRIGHTS: "public"
MDVERSION: "Aardvark"
LANG:
- "English"
PROVIDER: "American Geographical Society Library – UWM Libraries"
SUPPRESSED: false
RIGHTS:
- Although this data is being distributed by the American Geographical Society Library at the University of Wisconsin-Milwaukee Libraries, no warranty expressed or implied is made by the University as to the accuracy of the data and related materials. The act of distribution shall not constitute any such warranty, and no responsibility is assumed by the University in the use of this data, or related materials.
RESOURCETYPE:
- "Digital maps"
FORMAT: None
DESCRIPTION: This dataset was automatically cataloged from the creator's Open Data Portal. In some cases, publication year and bounding coordinates shown here may be incorrect. Additional download formats may be available on the author's website. Please check the 'More details at' link for additional information.
```
Following a small section of test sites, the rest of the file has nested records for each of the Hubs or portals we harvest from.
Here is an example of a record for the Wisconsin Department of Health Services Data Portal DCAT-compliant portal:
```yaml
DHS_OpenData:
CreatedBy: "Wisconsin Department of Health Services"
SiteURL: "https://data.dhsgis.wi.gov/data.json"
SiteName: "DHS"
Spatial: ["Wisconsin", "United States"]
DefaultBbox: "Wisconsin"
MapList: ""
AppList:
- UUID: "e1ca38bf16f54fb8ac879b386dbce422" # Flood Risk Map
- UUID: "861fc902539e436ebef7a86a10e9337b" # Immunization Map
- UUID: "43ed2d88cf1348608230572166d76697" # Radon Map
SkipList:
- UUID: "ca921d70bdd84ae8bc84cd09abd822d7" # link to census geography website
- UUID: "00883495714c42a9be53b76b24300c8e" # GIS data disclaimer
- UUID: "200036084844418bb3119d963cd7d98c" # OSDP Help?
- UUID: "29c62b7a834944ef8196573c123d7a9d"
```
This stores the URL where we access the catalog information as SiteURL,
some basic metadata fields that we want to remain consistent such as SiteName and Spatial,
a default bounding box
(defined in [default_bbox.csv](https://github.com/UWM-Libraries/GeoDiscovery-Utils/blob/main/opendataharvest/default_bbox.csv) by default)
in case the script is unable to parse spatial information from the dataset or the information is missing,
and three lists of Maps, Apps, and Skips.
The DCAT harvest script will assign special metadata attributes to datasets defined in these lists.
Datasets in the AppList will be assigned the Resource Class of "Websites".
Datasets in the MapList will be assigned the Resource Class of "Maps".
As the name implies, the script will skip over datasets listed in the skiplist.
These are typically links to other open data portals, placeholder records, and copies of data from other repositories
including ESRI basemaps.
We don't want to ingest these into our portal, so we add them to the skiplist.
There are some datasets that have other elements such as `DatasetPrefix` that are not being used at this time.

0 comments on commit 130ed8a

Please sign in to comment.