wrong bbox #3944

jjoao · 2022-10-15T19:26:01Z

Environment

Tesseract Version: tesseract 5.2.0-40-g3559
Commit Number: ?
Platform: Linux zdt 5.15.0-48-generic Osm branch #54-Ubuntu SMP Fri Aug 26 13:26:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

Bounding boxes in hocr and pdf outputs are not correct.
Consider the sentence in the image "-- Mas vamos ao principio -- respondi eu -- Quem"

Tesseract recognizes it correctly in text format, but hocr and pdf have errors:

<div class='ocr_carea' id='block_1_4' title="bbox 473 2435 2213 3180">
    <p class='ocr_par' id='par_1_7' lang='por' title="bbox 473 2435 2213 2805">
     <span class='ocr_line' id='line_1_20' title="bbox 593 2435 2213 2516; baseline -0.005 -17; x_size 79; x_descenders 21; x_ascenders 22">
      <span class='ocrx_word' id='word_1_154' title='bbox 593 2479 656 2485; x_wconf 92'>—</span>
      <span class='ocrx_word' id='word_1_155' title='bbox 685 2440 812 2499; x_wconf 95'>Mas</span>
      <span class='ocrx_word' id='word_1_156' title='bbox 843 2460 1047 2498; x_wconf 96'>vamos</span>
      <span class='ocrx_word' id='word_1_157' title='bbox 1076 2459 1144 2496; x_wconf 79'>ao</span>
      <span class='ocrx_word' id='word_1_158' title='bbox 1164 2435 1824 2516; x_wconf 89'>principio</span>
      <span class='ocrx_word' id='word_1_159' title='bbox 1493 2431 1524 2520; x_wconf 92'>—</span>
      <span class='ocrx_word' id='word_1_160' title='bbox 1553 2435 1821 2516; x_wconf 95'>respondi</span>
      <span class='ocrx_word' id='word_1_161' title='bbox 1857 2456 1915 2493; x_wconf 89'>cu</span>
      <span class='ocrx_word' id='word_1_162' title='bbox 1961 2471 1999 2478; x_wconf 89'>—</span>
      <span class='ocrx_word' id='word_1_163' title='bbox 2021 2435 2213 2508; x_wconf 96'>Quem</span>
     </span>

The word word_1_158 ("principio") has his x2 wrong (it ends after the begin and endings of the next words "—" and "respondi")

  |principio                      |
                   |—|  |respondi|

Expected Behavior:

For all adjacent words W1,W2 in a sentence:
xend(W1) <= xstar(W2)

Suggested Fix:

The text was updated successfully, but these errors were encountered:

tfmorris · 2023-11-13T19:54:08Z

There are a number of issue related to overlapping bounding boxes, including #2825 #3611 #3963

amitdo added bounding box bug labels Oct 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrong bbox #3944

wrong bbox #3944

jjoao commented Oct 15, 2022 •

edited

Loading

tfmorris commented Nov 13, 2023

wrong bbox #3944

wrong bbox #3944

Comments

jjoao commented Oct 15, 2022 • edited Loading

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

tfmorris commented Nov 13, 2023

jjoao commented Oct 15, 2022 •

edited

Loading