A neural network based detector for handwritten words.
- Download trained model, and place the unzipped files into the
model
directory - Go to the
src
directory and executepython infer.py
- This opens a window showing the words detected in the test images (located in
data/test
) - Required libs: torch, numpy, sklearn, cv2, path, matplotlib
- The model is trained with the IAM dataset
- Download the forms and the xml files
- Create a dataset directory on your disk with two subdirectories:
gt
andimg
- Put all form images into the
img
directory - Put all xml files into the
gt
directory
- Go to
src
and executepython train.py
with the following parameters specified (only the first one is required):--data_dir
: dataset directory containing agt
and animg
directory--batch_size
: 27 images per batch are possible on a 8GB GPU--caching
: cache the dataset to avoid loading and decoding the png images, cache file is stored in the dataset directory--pretrained
: initialize with saved model weights--val_freq
: speed up training by only validating each n-th epoch--early_stopping
: stop training after n validation steps without improvement
- The model weights are saved every time the f1 score on the validation set increases
- A log is written into the
log
directory, which can be opened with tensorboard - Executing
python eval.py
evaluates the trained model
- The model classifies each pixel into one of three classes (see plot below):
- Inner part of a word (plot: red)
- Outer part of a word (plot: green)
- Background (plot: blue)
- An axis-aligned bounding box is predicted for each inner-word pixel
- DBSCAN clusters the predicted bounding boxes
- The backbone of the neural network is based on the ResNet18 model (taken from torchvision, with modifications)
- The model is inspired by the ideas of Zhou and Axler
- See this article for more details