Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlapping Character Boundingboxes #2825

Open
RicketyRick opened this issue Dec 18, 2019 · 4 comments
Open

Overlapping Character Boundingboxes #2825

RicketyRick opened this issue Dec 18, 2019 · 4 comments

Comments

@RicketyRick
Copy link

Hi all, hope you have a joyful christmas time.

Tesseractversion: 4.* and 5.alpha.*
Platfom: Windows
Command line: tesseract .\billion.png out -l eng -c hocr_char_boxes=1 makebox hocr pdf
On this image:
billion

Result in box file:
B 210 18 218 48 0
i 210 18 234 47 0
l 237 18 258 48 0
l 259 18 269 48 0
i 270 18 280 48 0
o 282 18 303 41 0
n 305 18 327 41 0

Same in hocr:
<span class='ocrx_cinfo' title='x_bboxes 210 22 218 52; x_conf 99.543304'>B</span> <span class='ocrx_cinfo' title='x_bboxes 210 23 234 52; x_conf 99.536743'>i</span>

In PDF:
Open it in a PDF Viewer like Acrobat and mark "thousan".
marked

Then press Ctrl-C and in an Editor paste it with Ctrl-V: Result is "thousand"

Comment: We have this in German as well. Always using "best" or "fast" traindata. In bigger files there are many cases like above but I wanted to keep it as simple as possible.
We could reproduce this if we use the API directly, so I think the cause might be deep in the system.
The versions 4.* and 5.* differ in the outcome. In version 4.* there are sometimes different overlapping letters than in 5.*.

Thank you for all your work. Have a great christmas time and a happy new year!

@RicketyRick
Copy link
Author

#1192
#1712
#1015
#2024

@woodjohndavid
Copy link

Refer also to the following thread:

#2738

I attempted a workaround for filtering out duplicate characters using the character-level box dimensions to identify overlaps but this did not work because the box dimensions are invalid.

@amitdo
Copy link
Collaborator

amitdo commented Jan 28, 2020

I don't think there is an easy solution that will make Tesseract output accurate bounding boxes.

The neural net does not return bounding boxes. It outputs just 1 point in the x-position on the line for each glyph it recognizes. Tesseract tries to make bounding boxes from these points, but in many cases this conversion won't be accurate.

@amitdo
Copy link
Collaborator

amitdo commented Feb 7, 2020

In PDF:
Open it in a PDF Viewer like Acrobat and mark "thousan".
marked "thousan".

There is an issue that affects Adobe Acrobat: #2879

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants