Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong bbox #3944

Open
jjoao opened this issue Oct 15, 2022 · 1 comment
Open

wrong bbox #3944

jjoao opened this issue Oct 15, 2022 · 1 comment

Comments

@jjoao
Copy link

jjoao commented Oct 15, 2022

Environment

  • Tesseract Version: tesseract 5.2.0-40-g3559
  • Commit Number: ?
  • Platform: Linux zdt 5.15.0-48-generic Osm branch #54-Ubuntu SMP Fri Aug 26 13:26:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

Bounding boxes in hocr and pdf outputs are not correct.
Consider the sentence in the image "-- Mas vamos ao principio -- respondi eu -- Quem"

input image
Tesseract recognizes it correctly in text format, but hocr and pdf have errors:

<div class='ocr_carea' id='block_1_4' title="bbox 473 2435 2213 3180">
    <p class='ocr_par' id='par_1_7' lang='por' title="bbox 473 2435 2213 2805">
     <span class='ocr_line' id='line_1_20' title="bbox 593 2435 2213 2516; baseline -0.005 -17; x_size 79; x_descenders 21; x_ascenders 22">
      <span class='ocrx_word' id='word_1_154' title='bbox 593 2479 656 2485; x_wconf 92'>—</span>
      <span class='ocrx_word' id='word_1_155' title='bbox 685 2440 812 2499; x_wconf 95'>Mas</span>
      <span class='ocrx_word' id='word_1_156' title='bbox 843 2460 1047 2498; x_wconf 96'>vamos</span>
      <span class='ocrx_word' id='word_1_157' title='bbox 1076 2459 1144 2496; x_wconf 79'>ao</span>
      <span class='ocrx_word' id='word_1_158' title='bbox 1164 2435 1824 2516; x_wconf 89'>principio</span>
      <span class='ocrx_word' id='word_1_159' title='bbox 1493 2431 1524 2520; x_wconf 92'>—</span>
      <span class='ocrx_word' id='word_1_160' title='bbox 1553 2435 1821 2516; x_wconf 95'>respondi</span>
      <span class='ocrx_word' id='word_1_161' title='bbox 1857 2456 1915 2493; x_wconf 89'>cu</span>
      <span class='ocrx_word' id='word_1_162' title='bbox 1961 2471 1999 2478; x_wconf 89'>—</span>
      <span class='ocrx_word' id='word_1_163' title='bbox 2021 2435 2213 2508; x_wconf 96'>Quem</span>
     </span>

The word word_1_158 ("principio") has his x2 wrong (it ends after the begin and endings of the next words "—" and "respondi")

  |principio                      |
                   |—|  |respondi|

Expected Behavior:

For all adjacent words W1,W2 in a sentence:
xend(W1) <= xstar(W2)

Suggested Fix:

@tfmorris
Copy link
Contributor

There are a number of issue related to overlapping bounding boxes, including #2825 #3611 #3963

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants