-
Notifications
You must be signed in to change notification settings - Fork 191
Training Fraktur
Tesseract comes with several models which are specialized for Fraktur texts.
- dan_frak (Danish Fraktur)
- deu_frak (German Fraktur)
- slk_frak (Slovakian Fraktur)
- frk (German Fraktur)
- script/Fraktur
Those models also support historic and modern Antiqua scripts.
Some other models are primarly made for modern Antiqua scripts, but have a very limited ability to recognize Fraktur and historic Antiqua, too.
- deu (German)
- script/Latin
Neither of the above models is really good as a general model for Fraktur and historic Antiqua texts because each of them has specific problems.
dan_frak
, deu_frak
and slk_frak
are language specific, so they only support a limited set of characters.
They can only be used with the old legacy recognizer, not with the newer LSTM (neural network) recognizer.
Typically (not always!) the results from the legacy recognizer are worse than those from the LSTM recognizer.
frk
supports the German character set, but important characters like for example §
are missing and will never be recognized. In addition, some ligatures like ch
and ck
were trained wrongly and will therefore be recognized as <
and >
. script/Fraktur
supports a larger international character set, but otherwise has the same issues as frk
.
So to summarize, other models are needed for Fraktur and historic Antiqua. Such models can be trained either from scratch or based on one of the existing standard models.
This is a collection of sources for training OCR models which can be used to recognize Fraktur. A more complete list which is not restricted to Fraktur only can be found at https://github.com/cneud/ocr-gt.
Austrian Newspapers is a ground truth data set created from Austrian newspapers by the Austrian National Library (Österreichische Nationalbibliothek).See https://github.com/tesseract-ocr/tesstrain/wiki/AustrianNewspapers for more information.
GT4HistOCR is ground truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. See https://github.com/tesseract-ocr/tesstrain/wiki/GT4HistOCR for more information.
https://github.com/jze/ocropus-model_fraktur provides ground truth data, 3852 lines for training and 414 lines for testing, both of good quality.
- Some umlauts might be replaced by aͤ, oͤ, uͤ.
- It uses the minus sign instead of ⸗.