Handwritten Occupation Code Transcription Model

This project involves the development and use of a machine learning model to automatically transcribe handwritten 3-digit occupation codes from the Norwegian population census of 1950.

Features

Model architecture: CNN-RNN with a CTC end layer.
Accuracy: 97% on the provided training dataset.
Supported labels: Single digits (0-9), 't' (for text), and 'b' (for blank cells).
Training dataset details: 30,000 manually labeled images, 264 classes, highly imbalanced distribution.

Usage

Running the model

In order to run the model you need to have a database of images. Update the variables in Testing/inference_runner.py and Testing/inference.py with your database and table information, as well as the path to the model. These areas where an update is required is marked in the python scripts. After that, run the inference_runner.py script.

Training your own model

The script to train your own model can be found in Training/ctc_training.py. When providing training images to the model, the default way is to add the path to a folder containing your training set images. This is done to reflect the directory hierarchy of the provided training dataset. But this way of fetching training images can be altered to a database solution as well. The important part is that the script generates lists of images and 3-digit string labels. Once the script ctc_training.py has been updated with the path to the training set images, run the script.

Requirements

Please note that this model was trained using version 2.13 of tensorflow, and will require the same version if you wish to retrain the model.

Acknowledgements

The training dataset used in the project was manually labeled by our team at HistLab using a custom GUI (details available in the accompanying paper.)

Contact

For questions or feedback, please reach out to the project maintainers at [email protected]

References

For more details about the project and the custom GUI used for labeling, as well as general lessons learned from creating a ML transcription pipeline, please refer to the accompanying paper.

More information about the custom GUI as well as the manual work done to validate the model's outputs can be found in our follow-up paper.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
1_digit_model		1_digit_model
3_digit_model		3_digit_model
Augmentation		Augmentation
Birthplaces		Birthplaces
Field Extraction		Field Extraction
HelperFunctions		HelperFunctions
Model_Evaluation		Model_Evaluation
Preprocessing		Preprocessing
Production		Production
Testing		Testing
Training		Training
Validation_methods		Validation_methods
prediction_model_2025		prediction_model_2025
LICENSE		LICENSE
README.md		README.md
loadImagesFromFolders.py		loadImagesFromFolders.py
send_to_manual.py		send_to_manual.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Handwritten Occupation Code Transcription Model

Features

Usage

Running the model

Training your own model

Requirements

Acknowledgements

Contact

References

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

uit-hdl/rhd-codes

Folders and files

Latest commit

History

Repository files navigation

Handwritten Occupation Code Transcription Model

Features

Usage

Running the model

Training your own model

Requirements

Acknowledgements

Contact

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages