ULB ODEM

Project of the University and State Library Sachsen-Anhalt (ULB Sachsen-Anhalt) for OCR-D-Phase III founded by DFG 2021-2024 to implement an OCR-D-based Workflow for fulltext generation for existing digitalisates of "Drucke des 18. Jahrhunderts (VD18)".

Digitized prints are accessed as records via OAI-PMH from a record list which, at the time of project start, included about 40.000 prints (monographs and multivolumes) with total about 6Mio pages. Corresponding images are load to a local worker machine, then each page is processed individually with a complete OCR-D-Workflow. Afterwards, the results are transformed into ALTO-OCR and an archive file containing a new complete PDF for the print with textlayer is generated. The resulting archive file complies to the SAF fileformat of DSpace-Systems like Share_it.

Features

Monitors computing resources (RAM / disk space)
Runs both in virtual environment using local mount points or in isolated server machines
Processing print on page-level: In case of errors/problems, only single page is lost
Utilize print metadata (MODS) to select matching OCR model configuration
Utilize print metadata (METS) to filter pages for ocr-ing by blacklisting pages by logical structs or physical struct

Runtime Requirements

Minimum: Ubuntu Linux Server 20.04 LTS with 12 GB RAM / 8 CPUs (Recommended: 24 GB RAM / 12 CPUs)
Docker CE 19.03.13
Python 3.10
git, zip

Installation

# clone
git clone <repo-url> <local-dir>

# setup python venv
python3 -m venv venv
pip install -U pip
pip install -r requirements.txt

# run tests
python -m pip install pytest-cov
python -m pytest --cov=lib tests/ -v

Configuration

Options can found in the following sections:

[resource-monitoring] : monitor limits for disk and virtual memory usage
[mets] : blacklists for pages/logical sections
[ocr] : OCR-D-Container image, language model configuration mappings
[derivans] : Derivans container image and configuration

See for example resources/odem.ocrd.tesseract.ini.

Trigger Workflow via Crontab

Usually there is a record list (simple CSV-file) in the backend managed by cli_record_server.py module, which needs to be started. Please note, that no authentication restrictions are include. Ensure yourself it runs only in closed network environments.

ODEM client instances can be executed peridically, triggered by server cron jobs entries. Assuming there is local installation in /home/ocr/odem and a custom configurations located at <PROJECT>/resources/, it may look like this:

Start server process:

cd /home/ode/odem
python cli_record_server.py resources/odem.ocrd.tesseract.ini

Crontab entry for executing actual worker:

PYTHON_BIN=/home/ocr/odem/venv/bin/python3
PROJECT=/home/ocr/odem
RECORD_LIST=oai-records-opendata-vd18-odem

*/5  08-23  * * *  ${PYTHON_BIN} ${PROJECT}/cli_record_server_client.py ${RECORD_LIST} -c ${PROJECT}/resources/odem.ocr-worker01.ini -l

License

This project's source code is licensed under terms of the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 243 Commits
.github/workflows		.github/workflows
lib		lib
resources		resources
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cli_dir_local.py		cli_dir_local.py
cli_mets_local.py		cli_mets_local.py
cli_record_local.py		cli_record_local.py
cli_record_server.py		cli_record_server.py
cli_record_server_client.py		cli_record_server_client.py
requirements.txt		requirements.txt
setup-venv.sh		setup-venv.sh
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ULB ODEM

Features

Runtime Requirements

Installation

Configuration

Trigger Workflow via Crontab

License

About

Releases

Packages

Contributors 2

Languages

License

ulb-sachsen-anhalt/ocrd-odem

Folders and files

Latest commit

History

Repository files navigation

ULB ODEM

Features

Runtime Requirements

Installation

Configuration

Trigger Workflow via Crontab

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages