We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bounding boxes in hocr and pdf outputs are not correct. Consider the sentence in the image "-- Mas vamos ao principio -- respondi eu -- Quem"
Tesseract recognizes it correctly in text format, but hocr and pdf have errors:
<div class='ocr_carea' id='block_1_4' title="bbox 473 2435 2213 3180"> <p class='ocr_par' id='par_1_7' lang='por' title="bbox 473 2435 2213 2805"> <span class='ocr_line' id='line_1_20' title="bbox 593 2435 2213 2516; baseline -0.005 -17; x_size 79; x_descenders 21; x_ascenders 22"> <span class='ocrx_word' id='word_1_154' title='bbox 593 2479 656 2485; x_wconf 92'>—</span> <span class='ocrx_word' id='word_1_155' title='bbox 685 2440 812 2499; x_wconf 95'>Mas</span> <span class='ocrx_word' id='word_1_156' title='bbox 843 2460 1047 2498; x_wconf 96'>vamos</span> <span class='ocrx_word' id='word_1_157' title='bbox 1076 2459 1144 2496; x_wconf 79'>ao</span> <span class='ocrx_word' id='word_1_158' title='bbox 1164 2435 1824 2516; x_wconf 89'>principio</span> <span class='ocrx_word' id='word_1_159' title='bbox 1493 2431 1524 2520; x_wconf 92'>—</span> <span class='ocrx_word' id='word_1_160' title='bbox 1553 2435 1821 2516; x_wconf 95'>respondi</span> <span class='ocrx_word' id='word_1_161' title='bbox 1857 2456 1915 2493; x_wconf 89'>cu</span> <span class='ocrx_word' id='word_1_162' title='bbox 1961 2471 1999 2478; x_wconf 89'>—</span> <span class='ocrx_word' id='word_1_163' title='bbox 2021 2435 2213 2508; x_wconf 96'>Quem</span> </span>
The word word_1_158 ("principio") has his x2 wrong (it ends after the begin and endings of the next words "—" and "respondi")
word_1_158
|principio | |—| |respondi|
For all adjacent words W1,W2 in a sentence: xend(W1) <= xstar(W2)
The text was updated successfully, but these errors were encountered:
There are a number of issue related to overlapping bounding boxes, including #2825 #3611 #3963
Sorry, something went wrong.
No branches or pull requests
Environment
Current Behavior:
Bounding boxes in hocr and pdf outputs are not correct.
Consider the sentence in the image "-- Mas vamos ao principio -- respondi eu -- Quem"
Tesseract recognizes it correctly in text format, but hocr and pdf have errors:
The word
word_1_158
("principio") has his x2 wrong (it ends after the begin and endings of the next words "—" and "respondi")Expected Behavior:
For all adjacent words W1,W2 in a sentence:
xend(W1) <= xstar(W2)
Suggested Fix:
The text was updated successfully, but these errors were encountered: