Tesseract Legacy does not correctly identify abbreviations with periods #24

Balearica · 2022-05-31T00:05:30Z

Terms such as U.S., e.g. and i.e. are consistently misidentified by Tesseract Legacy, usually as US., eg. and ie. (respectively). This appears to be because Tesseract's language model does not expect punctuation to occur within words. While certain exceptions are made (notably, apostrophes are allowed), the possibility of a mid-word period does not appear to be included.

https://github.com/tesseract-ocr/tesseract/blob/706d3bac62954212e5236d91b3bff8e91cf7a3cc/src/wordrec/language_model.cpp#L1043-L1047

The text was updated successfully, but these errors were encountered:

Balearica transferred this issue from scribeocr/scribeocr Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tesseract Legacy does not correctly identify abbreviations with periods #24

Tesseract Legacy does not correctly identify abbreviations with periods #24

Balearica commented May 31, 2022

Tesseract Legacy does not correctly identify abbreviations with periods #24

Tesseract Legacy does not correctly identify abbreviations with periods #24

Comments

Balearica commented May 31, 2022