Character-based Canadian address parser mockup, built by machine learning and natural language processing. It uses a recurrent neural network trained on data randomly generated by a context-free grammar. A short technical report of the implementation is detailed on my website, which I encourage to read.
NOTE: this work is experimental, and the model provided in question is not suitable for production!
The key motivation of this work is to be a portfolio project. It is a voluntary exercise, on my part, to assemble data and a machine learning model to tackle a nontrivial natural language task. The project extends and improves upon my past class (statistical learning) project, where back then, my machine learning background was in its infancy.
The choice of address parsing is inspired by my past work at Statistics Canada. At the time, I was responsible to help assemble public Canadian infrastructure datasets from open sources. It was desired to tokenize addresses into a house number, street name, street type, etc., and many datasets we encountered had unsplit address strings.
All the scripts provided are written in Python and have been ran using Python 3.11.3. The key packages used are:
Package | Version |
---|---|
numpy |
1.24.3 |
pandas |
2.0.2 |
unidecode |
1.3.6 |
torch |
2.0.1 |
matplotlib |
3.7.1 |
seaborn |
0.12.2 |
scikit-learn |
1.2.2 |
For convenience, I have included requirements.txt
to install the above packages via pip
.
It may be possible to run the scripts with older versions of these packages or Python, but I have not attempted to test this.
NOTE that to run any of the main scripts, you must do so from the repository root! Any results produced by scripts, such as figures or model checkpoints, are self-contained and will appear in folders created in the repository root.
address/
: Tools to randomly generate addresses, augment text data with typos, and other text processing utilities.datasets/
: Used to randomly generate and save address data. For more information on the data appearing therein, read here.scripts/
: Helper scripts to extract outside data. These are not part of the workflow and can be ignored.
These are presented in order of execution if starting from scratch.
generate_data.py
: Randomly generate addresses.model.py
: Neural network model definition, termed CCAPNet (Canadian Civic Address Parser neural network).train.py
: Train the model. Has command line arguments that can be viewed with-h
or--help
.inference.py
: Use the model for inference on input text. Has command line arguments that can be viewed with-h
or--help
.plot_*.py
: Create figures for metrics and performance. Accepts a specific input from the model folder produced bytrain.py
.
The idea of character-level parsing and typo augmentation draws inspiration from Jason Rigby's AddressNet. If you are interested in address parsing or working with address data, I strongly recommend the following resources:
- libpostal: state-of-the-art international address normalizer and parser
- OpenAddresses: open global address data