-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSTM Engine Diplopia Issue and Inaccurate HOCR Character Level Box Dimensions #3477
Comments
It should be possible to filter cases of diplopia (for testing) if ground truth is available:
This is a typical one (punctuation at end of line):
Or difficult shape/separation:
|
What is the status of this pull request? In my personal opinion, Also, |
The problem seems to come from characters that are joined, but the right coordinate seems to be more reliable. In my situation I just look at the right coordinate to correct the left coordinate. |
With the pull request above: #3476 Some boxes still has weird height for some reason, it spans above and below the letter, but like 10x better result, I support this pull to be merged. |
I have filtered out boxes that are mostly correct (some correct boxes with @woodjohndavid Does it give you enough info to improve or should I look myself into your pull request? Below clear test image (Czech language, process with |
Hmm, now I have built just main without the patch, and it may have been improved by other patches and that pull request actually does nothing for me. |
Can anybody point me out to a code that cause these diplopias? |
@woodjohndavid Can you point me out to the code where LSTM engine returns values? |
@exander77 glad you are interested in addressing this issue. I would encourage you to re-read the starting entry on this thread, which, among other things, explains my understanding of why the bounding boxes are inaccurate. This is in a way related to the diplopia issue, in that the ultimate fix for both lies in the same area I believe. The code you are looking for is found in recodebeam.cpp. Method ComputeTopN is the start where the LSTM engine incoming results are first processed. |
@woodjohndavid Yes, I am interested. I can't get the hang on how are the results got from the network. Where is the info about position of the character and the with available? I see that each output contains a number of floats and that is passed to |
Interesting example. The text recognition with CTC/LTSM seems very accurate and has no diplopia with your sample. It's a character bounding box problem (with some influence of a not so perfect training model for CES). I tried to apply my script hacked together for PR#3599 and #3787. Find results here https://github.com/wollmers/ocr-bbox-gt/tree/main/data/issue_3477. Read #3599 (comment) for an explanation how it works. What I didn't take in to account are the 3 different fonts (bold, regular, italic) in your example. If we want to measure statistically the "best" width of a character, we must do this per font. This is a "hen and egg" problem: We need correct bounding boxes to identify the font. But we need also the font identified to get correct bounding boxes. Also, even with I still have a solution for font classification, which needs good quality bounding boxes, to measure features like width, height, aspect, density, vertical position, ascender, descender. Thanks to your input I got new ideas for improvements. |
@wollmers Yes, the diplopia is not an issue for me, the characters appear only once in the output stream. The character wrongly appears in two bounding boxes, or the bounding boxes are generally inaccurate. Also, I think there actually is a recognition problem as well. Word: Word Word Also: Interestingly, what I wanted to do is to identify bold text, and I was unable to do so, because the bounding boxes are not correct. If there is a solution for font classification, I would have use for it. Just compiling with the PRs you mentioned (and the diplopia one as well). See how that behaves. |
Also, the symbol |
Which version are you using? With Tesseract 5.1.0 on Intel Mac I get:
With
I get nice bounding boxes (few misinterpretations like speckles, split of Háček, overlaps because italic/kerning): |
@wollmers I built head of master (
|
Attached as zip for the possibility it gets altered. |
There are still the typical bbox errors: My observation is, that the number of boxes is correct (each recognised character has a box), but many boxes have wrong positions and/or width. |
@wollmers Yes, it is in no way perfect. Legacy is far superior to this. Except for |
FYI I've updated #3787 with some bug fixes that I found since the initial implementation. |
Environment: Tesseract Latest Master from GitHub, Ubuntu 20.04.2
User References: @bertsky @stweil
BackGround
The problem named Diplopia (courtesy of @bertsky) consists in there being more than 1 character appearing in the LSTM output character stream for what is the same physical area of the original image.
I encountered this issue early on in my use of Tesseract, and reported it on earlier thread #2738 . It has also been reported by many others. I then attempted to implement a workaround outside of the Tesseract code itself, using the HOCR output format character level box dimensions to try to identify overlapping characters. This was unsuccessful because, as it turns out, the character level box dimensions are inaccurate for LSTM generally, and are in fact guaranteed to be inaccurate when diplopia occurs.
So I then downloaded the latest Tesseract Master code and embarked on an expedition to try to understand how it works and see if I could come up with a fix for diplopia. The rest of this post documents the key results of my investigation.
Initial Diplopia Fix
I have just now created Pull Request #3476 which I hope is an adequate fix for most diplopia cases. See the PR for more details.
This fix generally follows the current style of the RecodeBeamSearch which attempts to assemble the character level output stream from the lower level LSTM NetworkIO matrix output. This matrix output delivers a set of entries for each timestep in the LSTM process, each entry consisting of a potential matching character and an likelihood score (key) in the range from 0.0 to 1.0.
There is nothing in the current matrix output that identifies the physical location of the possible match in the source image. Consequently, my fix attempts to identify possible diplopia by looking for two matrix output entries in a given timestep which have what could be called a 'meaningful' score, that is, a score that is high enough to indicate it is likely a 'real' match. If two such entries are found in the same timestep, then the fix tries to prevent any beam from subsequently containing both.
Inaccurate LSTM HOCR Character Level Box Dimensions
I had originally tried to use the HOCR dimensions as a workaround to fix diplopia, but found them inaccurate. I then pursued the diplopia fix above separately from this issue, but I have looked at how these dimensions are created and am of the opinion that the current implementation cannot ever be successful. What it does now consists of three sequential stages:
Long Term Solution to LSTM Diplopia and Character Box Dimensions
So as it turns out, these issues are in fact related, or at least the solution for both is. What both of them really need is the precise physical image location of the character match being attempting. If the character box dimensions were accurate, then diplopia could be solved either during the RecodeBeamSearch, or after the LSTM engine has done its thing. It would have to be determined how much of a physical overlap would mean we have diplopia, but that could be an easy configuration setting.
As I see it, therefore the LSTM matrix processing using the NetworkIO interface needs to add to its return values (in addition to the possible character and the likelihood score) the starting pixel location of the possible match, and the horizontal size of the potential match image from the train data. Once that is done, the rest should be relatively straightforward.
Having said that, I have spent a fair bit of time to try to understand the matrix operations, but so far have not been successful in how to accomplish the above suggestion. It MUST be the case that somewhere down in there that location information can be retrieved, and I intend to continue to look. But if anybody can give me some hints, it would be appreciated.
The text was updated successfully, but these errors were encountered: