Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

English traineddata file does not contain the '±' character? #48

Open
Furtifk opened this issue Oct 26, 2022 · 7 comments
Open

English traineddata file does not contain the '±' character? #48

Furtifk opened this issue Oct 26, 2022 · 7 comments

Comments

@Furtifk
Copy link
Contributor

Furtifk commented Oct 26, 2022

English traineddata file does not contain the '±' character?

Environment
Tesseract Version: 5.00 Downloaded from: https://github.com/UB-Mannheim/tesseract/wiki
Platform: Windows 10 64bit

I am trying to OCR using the English dictionary file found:
https://tesseract-ocr.github.io/tessdoc/Data-Files
I notice the character is not included here either:
https://github.com/tesseract-ocr/langdata_lstm/blob/main/eng/eng.unicharset

Are there any plans to add it? Are there any language files that contain successfully OCR this character?

Many thanks to whoever can assist here. I am attaching the file I used to test this behavior for this character here: (https://github.com/tesseract-ocr/langdata_lstm/files/9870674/Special.Symbols.pdf)

@amitdo
Copy link

amitdo commented Oct 26, 2022

Are there any plans to add it?

The best/fast models were uploaded 5 years ago. AFAIK, no one is working on updating them.

@Furtifk
Copy link
Contributor Author

Furtifk commented Oct 26, 2022

Thanks for the information and the fast reply. Would you know of any fix I could have access to OCR this character?

Many thanks ahead of time ^^

@stweil
Copy link
Member

stweil commented Oct 26, 2022

The official script/Latin model includes ±. You could also try any of my models from https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/, for example https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021_09/tessdata_fast/frak2021-09.traineddata.

@Furtifk
Copy link
Contributor Author

Furtifk commented Oct 26, 2022

The official script/Latin model includes ±. You could also try any of my models from https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/, for example https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/frak2021_09/tessdata_fast/frak2021-09.traineddata.

Thanks a lot. I will try this and let you know here if it does indeed work for us going forward.

@Furtifk
Copy link
Contributor Author

Furtifk commented Oct 27, 2022

After further testing, it would appear both lat.traineddata (https://tesseract-ocr.github.io/tessdoc/Data-Files) and your own model are struggling to get this char in my example.
Is this the latin dictionary file you meant as I have linked above? If not, where could I find this and download to try it?

Many thanks!

@stweil
Copy link
Member

stweil commented Oct 27, 2022

lat.traineddata is a different model. script/Latin is in https://github.com/tesseract-ocr/tessdata_fast/tree/main/script. Or simply re-run the installer and select it there for installation.

@Furtifk
Copy link
Contributor Author

Furtifk commented Oct 27, 2022

Thanks for the link. I have tried this on my end with the Latin.traineddata model but I'm still not having much luck with the test file and internal files on my end for getting this character.
I'm guessing there's not much else that can be done here? Thanks for the help and suggestions nonetheless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants