Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract inserting additional alternative characters #1465

Open
jghare opened this issue Apr 10, 2018 · 18 comments
Open

Tesseract inserting additional alternative characters #1465

jghare opened this issue Apr 10, 2018 · 18 comments

Comments

@jghare
Copy link

jghare commented Apr 10, 2018

Environment

  • Tesseract Version: <3.x stable and 4.0 alpha/beta> for English language text (using Fast and Best trained data) Command line

  • Platform: <Windows, version 64-bit and linux (Ubuntu/centos)-->

Current Behavior:

All versions of tesseract mentioned above tend to insert additional alternative characters (probably) whenever its not very confident. For example - if theres a "#" in the image file it often spits out "#H" or "A#" or even "AH"... Thats 2 characters for 1. Another example: If theres a "$" in the image then it gives "S$" or "$s" etc.. happens very often for other characters like 0,O,!,%,^ etc etc...
My application is very sensitive to length of the string hence an extra character throws many things off.
I am currently a command-line user and may later use it in Java whenever a wrapper for 4.0 becomes available.

Expected Behavior:

Expect tesseract to give out only one character for each character in the image. I should be able to control this behaviour using command line parameters (assuming there isn't one yet..). I have looked into the parameters but there are hundreds and mostly non-self-explanatory. Hence raising this as an issue. Also is it possible to get a "Character-level" HOCR output - current one is at word level granularity.

Suggested Fix:

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Apr 10, 2018 via email

@jghare
Copy link
Author

jghare commented Apr 10, 2018

Hi Shreeshrii
When I say English I mean the English alphabet and special characters. The words themselves are not dictionary words and are cryptic and long sequences of mixed characters... 20-30 characters long.
The 4.0 alpha and beta give me far superior results on the OCR than legacy on my images. Is there no way to tell tesseract 4.0 to not insert extra alternatives?
Also would be good to give it a white-list of characters. I see that issue is also open...

@mkrolready
Copy link

Please fix this.. It's a big problem.

@zdenop
Copy link
Contributor

zdenop commented Apr 20, 2018

If it is a big problem that provide user case. Description by words is difficult to test and developers are forced to spent useless time on find what is your problem instead of solving problems.

@vidiecan
Copy link

#1011 might be related

@Shreeshrii
Copy link
Collaborator

There are a number of issues regarding this, for different languages etc. Listing them below.

Incorrect recognotion of specific words - additional letters inserted #1011

tesseract add similar characters in Japanese text (ambiguity management?) #1063

German - Characters added to result multiple times (aä / AÄ) #1060

Tesseract LSTM 4.0: letters repeat in recognized text #884

@Shreeshrii
Copy link
Collaborator

Possibly related:

recognizes more characters than present #1362

@talentoscope
Copy link

talentoscope commented Sep 16, 2018

This is still present in the latest master branch. It seems to happen after retraining (finetuning) the original tessdata files - in my case eng - and appears to be a result of ambiguous output from the LSTM, where it is providing more than one character for a bounding box (or at least that's how it appears without actually checking) - i.e. it is giving its possible or "unconfident" characters as well. More training does seem to balance this out slightly, but it's very hit or miss.

@amitdo
Copy link
Collaborator

amitdo commented Sep 17, 2018

When I say English I mean the English alphabet and special characters. The words themselves are not dictionary words and are cryptic and long sequences of mixed characters... 20-30 characters long.

In that case try to disable the dictionary.
Also try to fine tune the model.

@Shreeshrii
Copy link
Collaborator

Also is it possible to get a "Character-level" HOCR output - current one is at word level granularity.

Yes. It is. Use -c hocr_char_boxes=1 hocr in your command line. Output is of the format:

<span class='ocrx_word' id='word_1_1' title='bbox 16 18 206 71; x_wconf 42'>
             <span class='ocrx_cinfo' title='x_bboxes 16 19 42 71; x_conf 99.041275'>B</span>
             <span class='ocrx_cinfo' title='x_bboxes 49 20 76 71; x_conf 99.038635'>A</span>
             <span class='ocrx_cinfo' title='x_bboxes 84 19 107 70; x_conf 98.950821'>S</span>
             <span class='ocrx_cinfo' title='x_bboxes 117 19 139 69; x_conf 91.848969'>O</span>
             <span class='ocrx_cinfo' title='x_bboxes 148 19 174 70; x_conf 99.027092'>B</span>
             <span class='ocrx_cinfo' title='x_bboxes 181 18 206 69; x_conf 98.989304'>C</span>

@eravallirao
Copy link

Hi,
I tried to use it, but it is not working for me. Any idea

@Togame-san
Copy link

C:\Program Files (x86)\Tesseract-OCR>tesseract testImage.PNG out -l check -c hocr_char_boxes=1 hocr
Could not set option: hocr_char_boxes=1
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
OSD: Weak margin (0.63), horiz textlines, not CJK: Don't rotate.
Detected 3 diacritics

It looks like this config is not longer there. I want the output that on a char level but it does not seen possible.

@stweil
Copy link
Member

stweil commented Nov 9, 2020

@jghare, can you provide some simple images which show this issue? That would help testing new code which tries to fix it.

@Petru-design
Copy link

Petru-design commented Feb 16, 2021

@stweil I encounter this issue on nearly daily basis with about 5% of the cases (around 2 per day on average). I will try to save the problematic files and the settings if that will be helpful to you.
In the mean time, here are some examples (called from python):

stelnum

--

file: stelnum -> text: 1C4BUOOOOKPJ60479 -> extracted text: 1Cc4dBUOOOOKPJ60479 -> config: --oem 1 --psm 1 -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQSRTUVWXYZ1234567890 lang=dan

--

regnumber

file: regnumber -> text: AJ38906 -> extracted text: AIJ38906 -> config: --oem 1 --psm 1 -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQSRTUVWXY lang=dan

--

You can find attached the images in an archive, hopefully they will prove useful test material. Please tell if you would like me to send more
images.zip

--

@amitdo
Copy link
Collaborator

amitdo commented Feb 16, 2021

For random sequence of characters you'll need to:

  1. Disable the dictionary.
  2. Fine tune the eng model with similar images.

If you have questions about fine tuning, use the forum.

@bertsky
Copy link
Contributor

bertsky commented Mar 16, 2021

@Shreeshrii

There are a number of issues regarding this, for different languages etc. Listing them below.

Incorrect recognotion of specific words - additional letters inserted #1011

tesseract add similar characters in Japanese text (ambiguity management?) #1063

German - Characters added to result multiple times (aä / AÄ) #1060

Tesseract LSTM 4.0: letters repeat in recognized text #884

recognizes more characters than present #1362

allow me to add #1465 and #2738.

My diagnosis for this bug was that it is specific to the Tesseract CTC implementation (with its NodeContinuation trick conflating paths to avoid the combinatorial explosion but creating an additional ambiguity of two adjacent nulls). I called these fake CTC duplicates diplopia. Someone definitely needs to work on this.

@seltix5
Copy link

seltix5 commented Feb 7, 2022

hello,
I have this problem too, any idea how I can help fix it?
I have this simple example :
191 14K
OCR result : 1921.14K
Analyzing other tests the problem is probably in the 9 because sometimes I get wrong results with 2s instead of 9s, In this case I got both.
I'm using this .NET wrapper (https://github.com/charlesw/tesseract/tree/feature/321-Tesseract-4) but I build and updated the tesseract and leptonica DLLs to the latest ones (leptonica-1.83.0 & tesseract50), using the "best" eng.traineddata and char whitelist "0123456789.,KMB".

@woodjohndavid
Copy link

I have just created pull request #4211 which I consider to be an improved solution for diplopia.

I encourage everyone on this trail to try this out and test it with as broad a range of cases as possible.

Note by the way, there are some new configuration values that can only be set in code as things stand. These configuration values are:

bool kRemoveDiplopia - if true, enables diplopia removal functionality. If false, my changes have no effect
int kMaxDiplopiaGap - maximum number of timesteps apart to be considered diplopia, default 2

Obviously if my diplopia change is of value, then these configuration items should be made into settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests