Address parsing with recurrent neural networks

Character-based Canadian address parser mockup, built by machine learning and natural language processing. It uses a recurrent neural network trained on data randomly generated by a context-free grammar. A short technical report of the implementation is detailed on my website, which I encourage to read.

NOTE: this work is experimental, and the model provided in question is not suitable for production!

Motivation

The key motivation of this work is to be a portfolio project. It is a voluntary exercise, on my part, to assemble data and a machine learning model to tackle a nontrivial natural language task. The project extends and improves upon my past class (statistical learning) project, where back then, my machine learning background was in its infancy.

The choice of address parsing is inspired by my past work at Statistics Canada. At the time, I was responsible to help assemble public Canadian infrastructure datasets from open sources. It was desired to tokenize addresses into a house number, street name, street type, etc., and many datasets we encountered had unsplit address strings.

Requirements

All the scripts provided are written in Python and have been ran using Python 3.11.3. The key packages used are:

Package	Version
`numpy`	1.24.3
`pandas`	2.0.2
`unidecode`	1.3.6
`torch`	2.0.1
`matplotlib`	3.7.1
`seaborn`	0.12.2
`scikit-learn`	1.2.2

For convenience, I have included requirements.txt to install the above packages via pip.

It may be possible to run the scripts with older versions of these packages or Python, but I have not attempted to test this.

Code organization and running scripts

NOTE that to run any of the main scripts, you must do so from the repository root! Any results produced by scripts, such as figures or model checkpoints, are self-contained and will appear in folders created in the repository root.

Folders

address/ : Tools to randomly generate addresses, augment text data with typos, and other text processing utilities.
datasets/ : Used to randomly generate and save address data. For more information on the data appearing therein, read here.
scripts/ : Helper scripts to extract outside data. These are not part of the workflow and can be ignored.

Main scripts

These are presented in order of execution if starting from scratch.

generate_data.py : Randomly generate addresses.
model.py : Neural network model definition, termed CCAPNet (Canadian Civic Address Parser neural network).
train.py : Train the model. Has command line arguments that can be viewed with -h or --help.
inference.py : Use the model for inference on input text. Has command line arguments that can be viewed with -h or --help.
plot_*.py : Create figures for metrics and performance. Accepts a specific input from the model folder produced by train.py.

Attributions and related work

The idea of character-level parsing and typo augmentation draws inspiration from Jason Rigby's AddressNet. If you are interested in address parsing or working with address data, I strongly recommend the following resources:

libpostal: state-of-the-art international address normalizer and parser
OpenAddresses: open global address data

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
CCAPNet		CCAPNet
address		address
datasets		datasets
scripts		scripts
.gitignore		.gitignore
CCAPNet.drawio		CCAPNet.drawio
CCAPNet.svg		CCAPNet.svg
LICENSE		LICENSE
README.md		README.md
generate_data.py		generate_data.py
inference.py		inference.py
model.py		model.py
plot_confusion_matrix.py		plot_confusion_matrix.py
plot_metrics.py		plot_metrics.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Address parsing with recurrent neural networks

Motivation

Requirements

Code organization and running scripts

Folders

Main scripts

Attributions and related work

About

Releases

Packages

Languages

License

mneyrane/CA-address-parser

Folders and files

Latest commit

History

Repository files navigation

Address parsing with recurrent neural networks

Motivation

Requirements

Code organization and running scripts

Folders

Main scripts

Attributions and related work

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages