This repository contains annotations for our mixed typeface/mixed context (body text and table text) OCR training corpus built by randomly sampling over 1000 pages from the Berliner Börsen-Zeitung. An effort was made to balance the amount of pages containing Fraktur or Antiqua respectively as well as the amount of lines drawn from each year.
To view/edit annotations you need the following artefacts:
- images of full pages
- line segmentation for each page
- annotations for selected lines (
annotations.db
)
To obtain 1. and 2., use one of these two:
- Mini Version (1 GB). Intended for viewing text annotations and for trying things out.
- Full Version (6,8 GB). Intended for editing annotations and exporting training data.
Now unzip annotate.db.zip
from this repository, put annotate.db
inside the images folder, then run:
python -m origami.tool.annotate /path/to/images