Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract Legacy does not correctly identify abbreviations with periods #24

Open
Balearica opened this issue May 31, 2022 · 0 comments
Open

Comments

@Balearica
Copy link
Contributor

Terms such as U.S., e.g. and i.e. are consistently misidentified by Tesseract Legacy, usually as US., eg. and ie. (respectively). This appears to be because Tesseract's language model does not expect punctuation to occur within words. While certain exceptions are made (notably, apostrophes are allowed), the possibility of a mid-word period does not appear to be included.

https://github.com/tesseract-ocr/tesseract/blob/706d3bac62954212e5236d91b3bff8e91cf7a3cc/src/wordrec/language_model.cpp#L1043-L1047

@Balearica Balearica transferred this issue from scribeocr/scribeocr Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant