-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect word coordinates #1712
Comments
Seems like this could be related to the issue I'm facing. |
I don't think there is a bug in this case. The lstm network only returns xcoords. AFAIK, each xcoord is just one spot (pixel) in each recognized glyph. The spot can be in the beginning, middle or end of each glyph. Because the lstm net does not produce bboxes, tesseract tries to estimate the bboxes from these xcoords. This process can't be done accurately. In cases where some bboxes are just complete garbage, there is certainly a bug somewhere, probably in the conversion from xcoords to bboxes. Don't mix between the 'no bug' cases and the buggy cases. |
Thinking about this again. As said, the cases with complete garbage boxes are certainly bugs. |
Hey there, I invested another day, and got into the LSTM decoder. I found nothing wrong with it and so i tested the outcome of different lstm models for the same and different languages. It turns out that this behavior is strongly model dependent, for example the standard and best english model as well as the fast german give the correct bounding boxes for the "thousand Billion" example, while other models of these languages fail to do so. I will write a little function, that first uses tesseracts layout analysis to obtain word (and probably character) positions and then overwrite the LSTM results with the results of the layout analysis when possible. So the bug source is unfixable for me and a possible workaround can be done. At the moment I'm planning to only create this workaround function for my own project. But I'm willing to integrate this workaround in the tesseract command line tool if someone is interested in using it. |
@Sintun I think this function will be useful for many Tesseract users, specially those who have been relying on the character coordinates in older versions. It would be great if you will integrate it in tesseract. |
I will write a little function, that first uses tesseracts layout
analysis to obtain word (and probably character) positions and then
overwrite the LSTM results with the results of the layout analysis when
possible.
@Sintun I think this function will be useful for many Tesseract users,
specially those who have been relying on the character coordinates in older
versions. It would be great if you will integrate it in tesseract.
|
Hi @FarhadKhalafi , |
Hi @Sintun , |
Specifically testing would be greatly appreciated :) Sharing the workaround code would be of small value, because it solves the problem after the tesseract api calls (even after the tess object destruction) in my project. I will refactor this in order to do this within tesseract. |
Even sharing that might help someone Sintun, a person in my team is writing that after API fix as well for python (it corrects the HOCR). I'll be sharing that as well when its done. |
Hey there, i just started to write my workaround and looked at the state of affairs. |
Please test current code. |
LSTM model output produces only approximate character positions without boundary data. This creates a problem because the input blobs cannot be accuratelly mapped to characters and thus the accuracy of character bounding boxes is compromised as well. Current this problem is solved as follows. The character boundaries are computed according to the character positions from the LSTM output by placing the boundaries at the middle between two character positions. The blobs are then assigned according to which character the center of the blob falls to. In other words the blobs are assigned to the nearest characters. This unfortunately produces a lot of errors because the character positions in the LSTM output have a tendency to drift, thus the nearest character is often not the right one. Fortunately while the LSTM model produces approximate positions, the blob boundaries produced by the regular segmenter are pretty good. Most of the time a single blob corresponds to a single character and vice-versa. The above is used to create an optimization algorithm that treats the output of the regular segmenter as a template to which LSTM model output is matched. The selection of best match is done by assigning each unwanted property of the outcome a cost and then minimizing the total cost of the solution. This reliably solves the most frequent error present in the current solution when blobs are simply assigned to wrong character. As a result the current algorithm produces up to 20 times less errors. Fixes tesseract-ocr#1712.
LSTM model output produces only approximate character positions without boundary data. This creates a problem because the input blobs cannot be accurately mapped to characters and thus the accuracy of character bounding boxes is compromised as well. Current this problem is solved as follows. The character boundaries are computed according to the character positions from the LSTM output by placing the boundaries at the middle between two character positions. The blobs are then assigned according to which character the center of the blob falls to. In other words the blobs are assigned to the nearest characters. This unfortunately produces a lot of errors because the character positions in the LSTM output have a tendency to drift, thus the nearest character is often not the right one. Fortunately while the LSTM model produces approximate positions, the blob boundaries produced by the regular segmenter are pretty good. Most of the time a single blob corresponds to a single character and vice-versa. The above is used to create an optimization algorithm that treats the output of the regular segmenter as a template to which LSTM model output is matched. The selection of best match is done by assigning each unwanted property of the outcome a cost and then minimizing the total cost of the solution. This reliably solves the most frequent error present in the current solution when blobs are simply assigned to wrong character. As a result the current algorithm produces up to 20 times less errors. Fixes tesseract-ocr#1712.
When using LSTM models the accuracy of character bounding boxes is low with many blobs assigned to wrong characters. This is caused by the fact that LSTM model output produces only approximate character positions without boundary data. As a result the input blobs cannot be accurately mapped to characters and which compromises the accuracy of character bounding boxes. Current this problem is solved as follows. The character boundaries are computed according to the character positions from the LSTM output by placing the boundaries at the middle between two character positions. The blobs are then assigned according to which character the center of the blob falls to. In other words the blobs are assigned to the nearest characters. This unfortunately produces a lot of errors because the character positions in the LSTM output have a tendency to drift, thus the nearest character is often not the right one. Fortunately while the LSTM model produces approximate positions, the blob boundaries produced by the regular segmenter are pretty good. Most of the time a single blob corresponds to a single character and vice-versa. The above is used to create an optimization algorithm that treats the output of the regular segmenter as a template to which LSTM model output is matched. The selection of best match is done by assigning each unwanted property of the outcome a cost and then minimizing the total cost of the solution. This reliably solves the most frequent error present in the current solution when blobs are simply assigned to wrong character. As a result the current algorithm produces up to 20 times less errors. Fixes tesseract-ocr#1712.
When using LSTM models the accuracy of character bounding boxes is low with many blobs assigned to wrong characters. This is caused by the fact that LSTM model output produces only approximate character positions without boundary data. As a result the input blobs cannot be accurately mapped to characters and which compromises the accuracy of character bounding boxes. Current this problem is solved as follows. The character boundaries are computed according to the character positions from the LSTM output by placing the boundaries at the middle between two character positions. The blobs are then assigned according to which character the center of the blob falls to. In other words the blobs are assigned to the nearest characters. This unfortunately produces a lot of errors because the character positions in the LSTM output have a tendency to drift, thus the nearest character is often not the right one. Fortunately while the LSTM model produces approximate positions, the blob boundaries produced by the regular segmenter are pretty good. Most of the time a single blob corresponds to a single character and vice-versa. The above is used to create an optimization algorithm that treats the output of the regular segmenter as a template to which LSTM model output is matched. The selection of best match is done by assigning each unwanted property of the outcome a cost and then minimizing the total cost of the solution. This reliably solves the most frequent error present in the current solution when blobs are simply assigned to wrong character. As a result the current algorithm produces up to 20 times less errors. Fixes tesseract-ocr#1712.
When using LSTM models the accuracy of character bounding boxes is low with many blobs assigned to wrong characters. This is caused by the fact that LSTM model output produces only approximate character positions without boundary data. As a result the input blobs cannot be accurately mapped to characters and which compromises the accuracy of character bounding boxes. Current this problem is solved as follows. The character boundaries are computed according to the character positions from the LSTM output by placing the boundaries at the middle between two character positions. The blobs are then assigned according to which character the center of the blob falls to. In other words the blobs are assigned to the nearest characters. This unfortunately produces a lot of errors because the character positions in the LSTM output have a tendency to drift, thus the nearest character is often not the right one. Fortunately while the LSTM model produces approximate positions, the blob boundaries produced by the regular segmenter are pretty good. Most of the time a single blob corresponds to a single character and vice-versa. The above is used to create an optimization algorithm that treats the output of the regular segmenter as a template to which LSTM model output is matched. The selection of best match is done by assigning each unwanted property of the outcome a cost and then minimizing the total cost of the solution. This reliably solves the most frequent error present in the current solution when blobs are simply assigned to wrong character. As a result the current algorithm produces up to 20 times less errors. Fixes tesseract-ocr#1712.
When using LSTM models the accuracy of character bounding boxes is low with many blobs assigned to wrong characters. This is caused by the fact that LSTM model output produces only approximate character positions without boundary data. As a result the input blobs cannot be accurately mapped to characters and which compromises the accuracy of character bounding boxes. Current this problem is solved as follows. The character boundaries are computed according to the character positions from the LSTM output by placing the boundaries at the middle between two character positions. The blobs are then assigned according to which character the center of the blob falls to. In other words the blobs are assigned to the nearest characters. This unfortunately produces a lot of errors because the character positions in the LSTM output have a tendency to drift, thus the nearest character is often not the right one. Fortunately while the LSTM model produces approximate positions, the blob boundaries produced by the regular segmenter are pretty good. Most of the time a single blob corresponds to a single character and vice-versa. The above is used to create an optimization algorithm that treats the output of the regular segmenter as a template to which LSTM model output is matched. The selection of best match is done by assigning each unwanted property of the outcome a cost and then minimizing the total cost of the solution. This reliably solves the most frequent error present in the current solution when blobs are simply assigned to wrong character. As a result the current algorithm produces up to 20 times less errors. Fixes tesseract-ocr#1712.
Environment
Tesseract Version:
(says beta.1, but beta.3 seems to be correct)
Commit Number: AppVeyor: 4.0.0-beta.3.1776
Platform: Windows 10 64bit (but tesseract running as 32bit)
Tessdata: tessdata-fast
We've integrated the engine in our (closed-source) application using the C API, so I cannot share the actual code.
What I do is basically just iterating over the result iterator using
RIL_WORD
, get the text, bounding box and baseline for each word and then creating a PDF out of it, with the recognized text as a red overlay and drawing the bounding boxes in green for better visibility.But the official PDF output config has the same flaw, it's just more difficult to spot:
produces the following PDF:
andromeda.tess4cli.pdf
Files:
Behavior can be reproduced using the following PNG (which is a part of a bigger file):
Current Behavior:
The coordinates are correct for most words.
But for some words, there seems to be an error in computing the boundaries between the words.
Couriously, the wrong boundary is always before the last character of the previous word.
I almost looks like some kind of off-by-one error.
Expected Behavior:
Result with tesseract 3:
The text was updated successfully, but these errors were encountered: