-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overlapping Character Boundingboxes #2825
Comments
Refer also to the following thread: I attempted a workaround for filtering out duplicate characters using the character-level box dimensions to identify overlaps but this did not work because the box dimensions are invalid. |
I don't think there is an easy solution that will make Tesseract output accurate bounding boxes. The neural net does not return bounding boxes. It outputs just 1 point in the x-position on the line for each glyph it recognizes. Tesseract tries to make bounding boxes from these points, but in many cases this conversion won't be accurate. |
There is an issue that affects Adobe Acrobat: #2879 |
Hi all, hope you have a joyful christmas time.
Tesseractversion: 4.* and 5.alpha.*

Platfom: Windows
Command line: tesseract .\billion.png out -l eng -c hocr_char_boxes=1 makebox hocr pdf
On this image:
Result in box file:
B 210 18 218 48 0
i 210 18 234 47 0
l 237 18 258 48 0
l 259 18 269 48 0
i 270 18 280 48 0
o 282 18 303 41 0
n 305 18 327 41 0
Same in hocr:
<span class='ocrx_cinfo' title='x_bboxes 210 22 218 52; x_conf 99.543304'>B</span> <span class='ocrx_cinfo' title='x_bboxes 210 23 234 52; x_conf 99.536743'>i</span>
In PDF:

Open it in a PDF Viewer like Acrobat and mark "thousan".
Then press Ctrl-C and in an Editor paste it with Ctrl-V: Result is "thousand"
Comment: We have this in German as well. Always using "best" or "fast" traindata. In bigger files there are many cases like above but I wanted to keep it as simple as possible.
We could reproduce this if we use the API directly, so I think the cause might be deep in the system.
The versions 4.* and 5.* differ in the outcome. In version 4.* there are sometimes different overlapping letters than in 5.*.
Thank you for all your work. Have a great christmas time and a happy new year!
The text was updated successfully, but these errors were encountered: