city-directory-entry-parser parses lines from OCR’d New York City directories into separate fields, such as names, occupations, and addresses.
city-directory-entry-parser is part of NYPL’s NYC Space/Time Directory project.
For more tools that are used to turn digitized city directories into datasets, see Space/Time’s City Directories repository.
This module relies on the sklearn-crfsuite implementation of a conditional random fields algorithm.
Input:
"Calder William W, clerk, 206 W. 24th"
Output:
{
"subjects": [
"Calder William W"
],
"occupations": [
"clerk"
],
"addresses": [
[
"206 W . 24th"
]
]
}
If the output contains an address
field, nyc-street-normalizer can be used to turn this abbreviated address into a full address (e.g. 668 Sixth av.
⟶ 668 Sixth Avenue
).
city-directory-entry-parser depends on the following Python modules:
numpy
sklearn
nltk
scipy
sklearn_crfsuite
From Python:
from cdparser import Classifier, Features, LabeledEntry, Utils
## Create a classifier object and load some labeled data from a CSV
classifier = Classifier.Classifier()
classifier.load_training("/full/path/to/training/nypl-labeled-train.csv")
## Optionally, load validation dataset
classifier.load_validation("/full/path/to/validation/nypl-labeled-validate.csv")
## Train your classifier (with default settings)
classifier.train()
## Create an entry object from string
entry = LabeledEntry.LabeledEntry("Cappelmann Otto, grocer, 133 VVashxngton, & liquors, 170 Greenwich, h. 109 Cedar")
## Pass the entry to the classifier
classifier.label(entry)
## Export the labeled entry as JSON
json.dumps(entry.categories)
From bash (using parse.py
):
cat /path/to/nypl-1851-1852-entries-sample.txt | python3 parse.py --training /path/to/nypl-labeled-70-training.csv