-
Notifications
You must be signed in to change notification settings - Fork 193
GT4HistOCR
This is an intermediate report on training for Fraktur which was done at Mannheim University Library (@UB-Mannheim). It is still unfinished, so new results will be added in the future. See the latest results (2019-10-11).
GT4HistOCR is ground truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. See this publication for details:
Springmann, Uwe, Reul, Christian, Dipper, Stefanie, & Baiter, Johannes. (2018). GT4HistOCR: Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin (Version 1.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.1344132 (see fulltext)
Use a recent Linux distribution. Newer distributions like Debian Buster provide Tesseract, so it is not necessary to build your own Leptonica or Tesseract.
Training requires much disk space, so use a working directory with at least 24 GiB free space.
Training also requires much CPU resources. There must be at least 4 CPU cores available. A fast CPU with AVX support is highly recommended.
The training data is in subdirectories and uses PNG images while tesstrain expects a flat directory with TIFF images, so some preparation is needed. TODO: This was fixed in the latest version.
# Clone tesstrain.
git clone https://github.com/tesseract-ocr/tesstrain.git
cd tesstrain
# Get uncharset files for some scripts (needed for character properties).
cd data
wget https://github.com/tesseract-ocr/langdata_lstm/raw/master/Cyrillic.unicharset
wget https://github.com/tesseract-ocr/langdata_lstm/raw/master/Greek.unicharset
wget https://github.com/tesseract-ocr/langdata_lstm/raw/master/Latin.unicharset
# Get the GT4HistOCR data.
mkdir GT4HistOCR
cd GT4HistOCR
wget https://zenodo.org/record/1344132/files/GT4HistOCR.tar
# Unpack the data.
tar xf GT4HistOCR.tar
for f in corpus/*.tar.bz2; do echo $f && tar xjf $f; done
# Optionally remove the tar archives which are no longer needed.
rm -r GT4HistOCR.tar corpus
# Remove BOM from some GT texts.
perl -pi -e s/$'\xEF\xBB\xBF'//g $(find Kallimachos/149* -name "*.gt.txt"|xargs grep -rl $'\xEF\xBB\xBF')
# Link all ground truth texts into one directory (can take more than 60 min).
cd ../ground-truth
for t in $(find ../GT4HistOCR -name "*.txt"); do ln -sf $t $(echo $t|perl -pe "s:.*GT4HistOCR/[^/]*/::; s:/:_:g"); done
# Convert png images to tiff images into one directory (can take more than 120 min).
for i in $(find ../GT4HistOCR -name "*.bin.png"); do echo convert $i $(echo $i|perl -pe "s:.*GT4HistOCR/[^/]*/::; s:/:_:g; s:bin.png:tif:"); done|sh -x
for i in $(find ../GT4HistOCR -name "*.nrm.png"); do echo convert $i $(echo $i|perl -pe "s:.*GT4HistOCR/[^/]*/::; s:/:_:g; s:nrm.png:tif:"); done|sh -x
# Now go back to the base directory. Training can start.
cd ../..
Training starts with generating box
files which also takes a lot of time.
Training from scratch was run in August 2019 with latest Tesseract (git master) and 10000, 100000, 300000 and 900000 iterations. The generated data is available online.
At iteration 9725/10000/10000, Mean rms=1.058%, delta=4.77%, char train=17.929%, word train=34.821%, skip ratio=0%, New worst char error = 17.929 wrote checkpoint.
Finished! Error rate = 17.396
[...]
real 83m31.333s
user 100m45.497s
sys 8m41.462s
At iteration 49622/100000/100001, Mean rms=0.395%, delta=1.033%, char train=3.344%, word train=9.289%, skip ratio=0%, wrote checkpoint.
Finished! Error rate = 3.041
[...]
real 366m7.096s
user 646m14.635s
sys 3m19.165s
At iteration 102549/300000/300002, Mean rms=0.311%, delta=0.799%, char train=2.364%, word train=6.189%, skip ratio=0%, wrote checkpoint.
Finished! Error rate = 1.747
[...]
real 780m7.610s
user 3746m24.974s
sys 68m52.158s
At iteration 237920/900000/900006, Mean rms=0.187%, delta=0.316%, char train=0.946%, word train=3.094%, skip ratio=0%, wrote checkpoint.
Finished! Error rate = 0.898
[...]
real 3200m2.210s
user 11396m12.393s
sys 164m28.908s
This training was run on two different machines, one using Tesseract 4.0.0 from Debian buster, one using Tesseract built from Git master. Despite the latest modifications for the tesstrain code, both trainings showed very different results. Tesseract 4.0.0 reported lots of encoding failures (caused by unnormalized ground truth texts). The achieved CER values differ, but are in a similar range.
Intermediate values:
- CER 0.561 at 368648 iterations with Tesseract Git master (433 h CPU time)
- CER 0.606 at 254680 iterations with Tesseract 4.0.0 (350 h CPU time).
$ time make -r training MAX_ITERATIONS=2000000 MODEL_NAME=GT4HistOCR_2000000 RATIO_TRAIN=0.99
[...]
At iteration 395372/2000000/2000012, Mean rms=0.167%, delta=0.323%, char train=0.901%, word train=2.741%, skip ratio=0%, wrote checkpoint.
Finished! Error rate = 0.528
[...]
real 7697m12.388s
user 26882m51.379s
sys 353m35.842s
time make -r training MAX_ITERATIONS=2000000 MODEL_NAME=GT4HistOCR_2000000 RATIO_TRAIN=0.99
[...]
At iteration 357313/2000000/2004959, Mean rms=0.167%, delta=0.298%, char train=0.888%, word train=3.022%, skip ratio=0.3%, wrote checkpoint.
Finished! Error rate = 0.58
[...]
real 10690m43,457s
user 34225m45,698s
sys 396m18,729s
More training is currently (September/October 2019) running on a virtual machine in the bwCloud. It is based on Git master (7a7704bc94e1942ee10047970b6c93e4871b2cd8) which can directly handle the images from GT4HistOCR, so no conversion to TIFF is needed. Up to now, each of the training runs has consumed about 3200 hours of CPU time.
For this training, all known problems in the GT4HistOCR ground truth texts were fixed. In addition, upper case "J" in Roman numerals and before lower case consonants was replaced by "I".
make -r training MAX_ITERATIONS=5000000 MODEL_NAME=GT4HistOCR_5000000 RATIO_TRAIN=0.99
Current best CER: 0.70 % at iteration 354676
make -r training NET_SPEC=[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c###] MAX_ITERATIONS=5000000 MODEL_NAME=GT4HistOCR_5000000-2 RATIO_TRAIN=0.99
Current best CER: 0.95 % at iteration 147211
make -r training START_MODEL=Fraktur TESSDATA=/usr/local/share/tessdata/tessdata_best/script MAX_ITERATIONS=5000000 MODEL_NAME=Fraktur_5000000 RATIO_TRAIN=0.99
Current best CER: 0.529 % at iteration 170345
This is the best CER achieved so far.
Some OCR results for a double page from a historic newspaper (Deutscher Reichsanzeiger) which were produced with existing models from Google and new models from the training above are also available online.
The current best models from fine tuning of Tesseract's script/Fraktur model show a significant better CER for selected samples. Using the new best model reduces the CER from 3.8 % to 2.4 %. The combination of all three best models achieves a CER of 2.2 %.
Starting with an existing traineddata
model requires much less iterations to achieve low error rates. The following models are candidates as a starting point:
-
eng.traineddata
(English, mainly Antiqua but also some Fraktur) -
frk.traineddata
(German, Fraktur and some Antiqua) -
script/Latin.traineddata
(Western Europe, mainly Antiqua but also some Fraktur) -
script/Fraktur.traineddata
(Western Europe, Fraktur and some Antiqua)
- There exist
*.bin.png
and*.nrm.png
images. Why are there these two variants? Allnrm.png
files are 8-bit grayscale. So are mostbin.png
files, only a few of them are 1-bit grayscale. - Broken image: dta19/1882-keller_sinngedicht/04970.nrm.png.
unicharset_extractor --output_unicharset "data/foo/unicharset" --norm_mode 1 "data/foo/all-boxes"
Bad box coordinates in boxfile string! 0 0 1044 71 0
It looks like the box data belongs to Kallimachos/1497-StultiferaNauis-GW5056/00401.gt.txt (see image). That text file looks strange in the vi editor (S<feff>ãcta dei ſpernãt poſita decreta pauoꝛe:
). Removing the <feff>
fixes it, but then other similar bad box coordinates are found. So there exist several ground truth text files which cause similar errors. Here is a list of all files with <feff>
:
geändert: Kallimachos/1495-DasNeuNarrenschiff-GW5049/00470.gt.txt
geändert: Kallimachos/1495-DasNeuNarrenschiff-GW5049/01255.gt.txt
geändert: Kallimachos/1497-StultiferaNauis-GW5056/00134.gt.txt
geändert: Kallimachos/1497-StultiferaNauis-GW5056/00401.gt.txt
geändert: Kallimachos/1497-StultiferaNauis-GW5056/00475.gt.txt
geändert: Kallimachos/1497-StultiferaNauis-GW5056/00593.gt.txt
geändert: Kallimachos/1497-StultiferaNauis-GW5056/00808.gt.txt
geändert: Kallimachos/1497-StultiferaNauis-GW5056/01311.gt.txt
geändert: Kallimachos/1497-StultiferaNauis-GW5056/01323.gt.txt
Tesseract fails to encode these ground truth texts:
- dta19/1853-rosenkranz_aesthetik/03221.gt.txt: ἑαυτῳ. Daß die aus der Einheit, Verſchiedenheit, Regula— (image)
- dta19/1879-vischer_auch02/03739.gt.txt: Ἰάϰχε, Ἰάϰχε! Wie blitzen ihre großen Augen! Noch (image)
This is caused by a mismatch of unicode characters in those ground truth texts and the generated unicharset
. The unicode characters in the ground truth are not normalized, and so they are in the derives box
and lstmf
files, while those in the unicharset
are normalized.
There are 771 unnormalized ground truth texts in the GT4HistOCR data set:
- dta19/1853-rosenkranz_aesthetik/03221.gt.txt
- dta19/1879-vischer_auch02/03739.gt.txt
- Kallimachos/1488-Heiligenleben-GWM11407/ (many)
There is only an empty ground truth text file for dta19/1819-goerres_revolution/01305. Either remove that data or enter the missing text Welt und Leben zu beherrschen wissen ; nicht Feldher⸗
.
Tesseract 4.0.0 does not create an lstm file for 363 images when training runs with the default settings. Example: EarlyModernLatin/1668-Leviathan-Hobbes/00045. There is no error message, so the training silently ignores those ground truth files. Using PSM=13
allows building the missing files.
Inherited characters are used for combining characters. They are used in some ground truth texts. Maybe Tesseract does not handle them correctly. At least it complains about a missing file Inherited.unicharset
:
$ combine_lang_model --input_unicharset data/foo/unicharset --script_dir data --output_dir data --lang foo
Loaded unicharset of size 305 from file data/foo/unicharset
Setting unichar properties
Other case T᷑ of t᷑ is not in unicharset
Other case Õ of õ is not in unicharset
Other case Ẽ of ẽ is not in unicharset
[...]
Other case H̃ of h̃ is not in unicharset
Setting script properties
Failed to load script unicharset from:data/Inherited.unicharset
Warning: properties incomplete for index 14 = ꝙ
Warning: properties incomplete for index 23 = t᷑
Warning: properties incomplete for index 39 = qᷓ
Warning: properties incomplete for index 51 = ꝗ
Warning: properties incomplete for index 54 = ꝗᷓ
Warning: properties incomplete for index 57 = ꝶ
[...]
The transcription for EarlyModernLatin/1476-SpeculumNaturale-Beauvais/01737.bin.png contains an inherited character ̃
which can cause a wrong unicharset for Tesseract. It also looks like a wrong transcription, so removing that character is suggested.
- dta19/1879-vischer_auch02/03739.gt.txt contains an accent were the line image doesn't. The original image shows the accent, so this mismatch is a binarization artifact.
It might be possible to evaluate the existing ground truth by manual inspection of ground truth text lines which differ much from the corresponding OCR result.
dta uses —
extensively as separator, for example at line endings. Others use -
or ⸗
. So currently the same glyph ⸗
gets at least three different transcriptions.
The ground truth texts of GT4HistOCR are partially harmonized, for example all I
were replaced by J
. Are we sure that is required for OCR of Fraktur texts? It might be bad for those parts which use Antiqua letters where I
and J
are clearly distinct, and it is also unwanted for Roman numerals. Maybe this should be reverted.
GT4HistOCR was published using CC-BY 4.0 (see paper) or CC-BY-SA 4.0 (see README). It is unclear what this implies for any OCR model which was trained using that data set or for text which was recognized using such models.