Overlapping Character Boundingboxes #2825

RicketyRick · 2019-12-18T07:58:56Z

Hi all, hope you have a joyful christmas time.

Tesseractversion: 4.* and 5.alpha.*
Platfom: Windows
Command line: tesseract .\billion.png out -l eng -c hocr_char_boxes=1 makebox hocr pdf
On this image:

Result in box file:
B 210 18 218 48 0
i 210 18 234 47 0
l 237 18 258 48 0
l 259 18 269 48 0
i 270 18 280 48 0
o 282 18 303 41 0
n 305 18 327 41 0

Same in hocr:
<span class='ocrx_cinfo' title='x_bboxes 210 22 218 52; x_conf 99.543304'>B</span> <span class='ocrx_cinfo' title='x_bboxes 210 23 234 52; x_conf 99.536743'>i</span>

In PDF:
Open it in a PDF Viewer like Acrobat and mark "thousan".

Then press Ctrl-C and in an Editor paste it with Ctrl-V: Result is "thousand"

Comment: We have this in German as well. Always using "best" or "fast" traindata. In bigger files there are many cases like above but I wanted to keep it as simple as possible.
We could reproduce this if we use the API directly, so I think the cause might be deep in the system.
The versions 4.* and 5.* differ in the outcome. In version 4.* there are sometimes different overlapping letters than in 5.*.

Thank you for all your work. Have a great christmas time and a happy new year!

The text was updated successfully, but these errors were encountered:

RicketyRick · 2019-12-18T07:59:47Z

#1192
#1712
#1015
#2024

woodjohndavid · 2019-12-19T20:36:49Z

Refer also to the following thread:

#2738

I attempted a workaround for filtering out duplicate characters using the character-level box dimensions to identify overlaps but this did not work because the box dimensions are invalid.

amitdo · 2020-01-28T12:24:41Z

I don't think there is an easy solution that will make Tesseract output accurate bounding boxes.

The neural net does not return bounding boxes. It outputs just 1 point in the x-position on the line for each glyph it recognizes. Tesseract tries to make bounding boxes from these points, but in many cases this conversion won't be accurate.

amitdo · 2020-02-07T14:43:47Z

In PDF:
Open it in a PDF Viewer like Acrobat and mark "thousan".
marked "thousan".

There is an issue that affects Adobe Acrobat: #2879

erikbs mentioned this issue Sep 24, 2020

Incorrect character bounding boxes #3105

Open

amitdo added the bounding box label Mar 12, 2021

tfmorris mentioned this issue Nov 13, 2023

wrong bbox #3944

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overlapping Character Boundingboxes #2825

Overlapping Character Boundingboxes #2825

RicketyRick commented Dec 18, 2019

RicketyRick commented Dec 18, 2019

woodjohndavid commented Dec 19, 2019

amitdo commented Jan 28, 2020

amitdo commented Feb 7, 2020 •

edited

Loading

Overlapping Character Boundingboxes #2825

Overlapping Character Boundingboxes #2825

Comments

RicketyRick commented Dec 18, 2019

RicketyRick commented Dec 18, 2019

woodjohndavid commented Dec 19, 2019

amitdo commented Jan 28, 2020

amitdo commented Feb 7, 2020 • edited Loading

amitdo commented Feb 7, 2020 •

edited

Loading