Skip to content

ulb-sachsen-anhalt/ocrd-odem

Repository files navigation

ULB ODEM

Python application

Project of the University and State Library Sachsen-Anhalt (ULB Sachsen-Anhalt) for OCR-D-Phase III founded by DFG 2021-2024 to implement an OCR-D-based Workflow for fulltext generation for existing digitalisates of "Drucke des 18. Jahrhunderts (VD18)".

Digitized prints are accessed as records via OAI-PMH from a record list which, at the time of project start, included about 40.000 prints (monographs and multivolumes) with total about 6Mio pages. Corresponding images are load to a local worker machine, then each page is processed individually with a complete OCR-D-Workflow. Afterwards, the results are transformed into ALTO-OCR and an archive file containing a new complete PDF for the print with textlayer is generated. The resulting archive file complies to the SAF fileformat of DSpace-Systems like Share_it.

Features

  • Monitors computing resources (RAM / disk space)
  • Runs both in virtual environment using local mount points or in isolated server machines
  • Processing print on page-level: In case of errors/problems, only single page is lost
  • Utilize print metadata (MODS) to select matching OCR model configuration
  • Utilize print metadata (METS) to filter pages for ocr-ing by blacklisting pages by logical structs or physical struct

Runtime Requirements

  • Minimum: Ubuntu Linux Server 20.04 LTS with 12 GB RAM / 8 CPUs (Recommended: 24 GB RAM / 12 CPUs)
  • Docker CE 19.03.13
  • Python 3.10
  • git, zip

Installation

# clone
git clone <repo-url> <local-dir>

# setup python venv
python3 -m venv venv
pip install -U pip
pip install -r requirements.txt

# run tests
python -m pip install pytest-cov
python -m pytest --cov=lib tests/ -v

Configuration

Options can found in the following sections:

  • [resource-monitoring] : monitor limits for disk and virtual memory usage
  • [mets] : blacklists for pages/logical sections
  • [ocr] : OCR-D-Container image, language model configuration mappings
  • [derivans] : Derivans container image and configuration

See for example resources/odem.ocrd.tesseract.ini.

Trigger Workflow via Crontab

Usually there is a record list (simple CSV-file) in the backend managed by cli_record_server.py module, which needs to be started. Please note, that no authentication restrictions are include. Ensure yourself it runs only in closed network environments.

ODEM client instances can be executed peridically, triggered by server cron jobs entries. Assuming there is local installation in /home/ocr/odem and a custom configurations located at <PROJECT>/resources/, it may look like this:

Start server process:

cd /home/ode/odem
python cli_record_server.py resources/odem.ocrd.tesseract.ini

Crontab entry for executing actual worker:

PYTHON_BIN=/home/ocr/odem/venv/bin/python3
PROJECT=/home/ocr/odem
RECORD_LIST=oai-records-opendata-vd18-odem

*/5  08-23  * * *  ${PYTHON_BIN} ${PROJECT}/cli_record_server_client.py ${RECORD_LIST} -c ${PROJECT}/resources/odem.ocr-worker01.ini -l

License

This project's source code is licensed under terms of the MIT license.

About

OCR Workflows based on OCR-D

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published