Skip to content

Latest commit

 

History

History
38 lines (22 loc) · 3.2 KB

README.md

File metadata and controls

38 lines (22 loc) · 3.2 KB

FAST Times README

Description

This repository is a place to store or reference all scripts and related documents that are used in the ongoing subject reconciliation process by DOMM.

FAST Times: the original idea

FAST_times.py is a script that takes existing LCSH subjects from the DAMS along with their ID/URI (in this case, an ARK) and spits those into tab delimited .csv files.

It also formats the subjects in order to effectively query them against the FAST API.

The FAST API returns suggestions, which contain the authorized heading, its ID, and the MARC tag number. The script puts that into a third .csv file.

concatenate.py is a script that then concatenates the three .csv files (using the Python library pandas) into one master spreadsheet.

This spreadsheet can then be imported to OpenRefine for data wrangling. In this case, it's used to parse the JSON responses from the API, create rows based on the number of responses, parse the IDs into valid URLs, and display the relevant MARC tag.

Not-so-FAST Times: the backup plan

In not-so-fast times, we need a different solution. less_FAST_times.py strips out the API calling of the original script, and instead handles just the subject string formatting and then relies on an existing OpenRefine reconciliation script (credit to Christina Harlow).

The rationale for this script using OpenRefine is that for thousands of these subject, we need automatch functionality, since we do not have the time or resources to make that many manual matches. This process still involves some manual matching, but only on 5%-10% of the terms, instead of ~90%.

Reference to other scripts

Once in OpenRefine, a separate script uses OpenRefine's own reconciliation functionality to generate a reconciliation service. My slight tweak to the original script is that instead of a ranked number being returned reflecting the match accuracy of the term, it instead returns the MARC tag number that the subject/genre falls under. We use the MARC tag because all of our local topics are 'complex', meaning they can be a combination of different types (geographic, topical, etc.). Decomposing these subjects while acquiring the MARC tag allows us to know what type of subject the matches are (which we can then perhaps run against different vocabularies like GeoNames). You can see my tweak/fork here.

Reference to other reconciliation services

FAST is a big focus of the project, but we are reconciling to linked data vocabularies using these services:

TODO: Once the workflow is confirmed, merge the contents of workflow.md into this document.