Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract doesn't always recognise diacritics #4276

Open
arsinclair opened this issue Jun 30, 2024 · 3 comments
Open

Tesseract doesn't always recognise diacritics #4276

arsinclair opened this issue Jun 30, 2024 · 3 comments

Comments

@arsinclair
Copy link

arsinclair commented Jun 30, 2024

Current Behavior

I'm using Tesseract indirectly as part of OCRmyPDF and I'm coming here from this issue.

When OCR'ing English (Latin) text with diacritics it doesn't always recognise them. The diacritics in my document are part of surnames originating from Hungary and Belgium.

I've tried with just English, English + Hungarian dictionaries, also tried with Latin script (which has extended character map) to no avail.

The words: poéme, pathétique, animé are recognised.

The words: Ysaÿe, Jenő, Petőfi, etc. are not recognised.

The words csárdás, Telmányi, Dvořák are recognised only with Latin script.

Expected Behavior

The diacritics should be recognised.

Source files

000001_ocr
000004_ocr
000005_ocr
000002_ocr
000003_ocr

tesseract -v

tesseract 5.3.4
leptonica-1.82.0
libgif 5.2.2 : libjpeg 6b (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.4.0 : libopenjp2 2.5.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.7.2 zlib/1.3.1 liblzma/5.4.5 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
Found libcurl/8.8.0 OpenSSL/3.2.2 zlib/1.3.1 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.7 libpsl/0.21.2 libssh2/1.11.0 nghttp2/1.62.1 librtmp/2.3 OpenLDAP/2.5.18

Operating System

Debian Testing (Bookworm)

uname -a

Linux jrm-ws 6.8.12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.8.12-1 (2024-05-31) x86_64 GNU/Linux

@stweil
Copy link
Member

stweil commented Jun 30, 2024

eng.traineddata was not trained with diacritics (see https://github.com/tesseract-ocr/langdata_lstm/blob/main/eng/eng.unicharset) and therefore cannot recognize them.

Latin.traineddata was trained with some diacritics (see https://github.com/tesseract-ocr/langdata_lstm/blob/main/script/Latin/Latin.unicharset) and therefore works better with your text. As far as I see "ő" is missing in its supported characters.

So your results are expected with the given models, and it's not a Tesseract issue.

"ő" is included in hun.traineddata, so you could try Latin+hun, but training a new model would be better.

@arsinclair
Copy link
Author

arsinclair commented Jul 1, 2024

"ő" is included in hun.traineddata, so you could try Latin+hun, but training a new model would be better.

Tried with Hungarian and Latin too, didn't always work. And if training the new model is the only way forward, I'll have to do it myself, or it can be added to the existing Tesseract models?

@nkrot
Copy link

nkrot commented Sep 3, 2024

@arsinclair , your question is a bit vague.

You can always take an official model (eng.traineddata file) and finetune it for new characters. You can tell tesseract to use your model instead of the conventional one. I am doing this way.

If you want to publish your model for the world, you will have to talk tesseract maintainers.

Tesseract for English is aware only a few diacritics. I find the choice of the latter weird but this is the reality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants