-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate Characters in Output Stream #2738
Comments
IMO it's perfectly legitimate to raise this issue again here. It has already surfaced several times under different names and descriptions, e.g. #1465. The usual recommendation is to improve the model quality. And this does of course help in reducing the likelihood of this happening. But nevertheless the underlying flaw (and you could also call it a bug) in the basic CTC implementation is still there. And it is more likely to surface when decoding less probable output segments (as happens with I have (tentatively) termed the phenomenon of fake CTC duplicates diplopia, and recommended using Equal Spacing CTC or similar as a mitigation. |
Thanks for the response Bertsky. Hopefully someone will take a look at trying to fix this issue. In the meanwhile, what I have done is, using the character level HOCR output, implemented a scan of that output to identify characters whose box dimensions overlap 'significantly' and then select only the highest confidence level character from those duplicates. Another small question: could you please tell me where to post issues (not just questions) about Tesseract? Is the Google tesseract-dev group active? My posting there received no response. Is this Github Issues section the right place? |
That's a very good workaround, and it would also work inside the beam decoder. It's only a question of finding the best parameter set (maximum confidence, minimum overlap absolute/relative) for different languages/scripts objectively (i.e. on large corpora)... But then again, if we had such a test system, we could quickly evaluate the impact of equal spacing CTC as well.
You are already in the right place for (possible) bugs and feature requests. As for mailing groups, I'm not qualified to answer that. |
Hello again: Well, it turns out that my workaround is not a good solution after all, as the character level box dimensions are not accurate in some cases. So this really needs to be promoted to being a bug of some kind, at least in so far as how the character level box dimensions are determined. Attached is definitive proof of one case, although I have encountered many of them. This concerns the word "Cell" in the following sample image run through Tesseract. Attached are the following related files: Sample Boxes Original.png - original image fed into Tesseract Following is the snippet from the HOCR specific for the word "Cell" which is on its own near the center of the original image.
If you examine this case, you will see that the box dimensions for the letters 'C' and 'e' overlap significantly, hence resulting in my attempted workaround for removing duplicates to remove the letter 'C' from my output. However, if you actually look at the boxes on the source image (see my paint.net screen shots) you will see that the box for the letter 'e' simply makes no sense and cannot possibly be what Tesseract used to extract the letter 'e' with a confidence level of 99.56. I have encountered many such examples, a lot of them where the box dimensions used to correctly select a particular character cover an area which includes the previous or next character as well. |
Thanks @woodjohndavid for providing details. I can confirm this with the current master. Here are all the boxes of that word: That's clearly a bug. Looking at the debug log with
...(from
...(from @stweil, do you think this could be related to your and Noah's fixes in #2576? |
Thanks Bertsky for confirming the issue. As a Tesseract newbie, could I impose upon you yet again to give me some idea of when and how bugs are prioritized and potentially worked on? Is there any Tesseract development activity actually underway at this point? I understand fully that Tesseract is open source, and hence I have no basis for any expectations whatsoever. But I would like to understand what the current state of development activity is. I doubt that I have the necessary technical skills to contribute to Tesseract development, but would be interested to know how one gets involved in that if one chooses. Thanks in advance for whatever light you can shed on this for me. |
@woodjohndavid I can only give you my personal impression on the questions you just raised. This is obviously a diverse and open community, perspectives and circumstances of contributers/developers vary substantially. What gets done how soon depends on many things, notably:
For current development efforts, cf. https://github.com/tesseract-ocr/tesseract/wiki/Planning. If you want to contribute yourself,
|
OK thanks @bertsky, much appreciated. I realize also that this is not the right forum for these kind of learning questions, but I have had little luck in getting anyone else to respond to them. So just one more, if you would be so kind: is there a leader or manager of the code base responsible for some kind of vetting of contributions before they enter the main code branch? If so, who? Thanks again. |
There are people here with write permissions, but the reviewing work itself is usually shared. You can find more out by looking at the closed PRs or the contributer list. |
Is there any likelihood that the issue of inaccurate character level bounding box dimensions will be addressed sometime soon? Of course, the real underlying issue is that the Tesseract LSTM engine is including multiple alternative characters in the output stream. However, it seems likely that the latter issue would be harder to correct. If the character level box dimensions could be made accurate, then the workaround that I proposed earlier in this thread for the duplicate character issue would in fact work. |
I second this. The wrong character level box stops us and our partner companies to use tesseract and we need to subscribe to these bad and expensive APIs of Abbyy and OmniPage. I would rather use Tesseract. |
@woodjohndavid, @RicketyRick, the development process is currently entirely community driven. Code changes are provided by volunteers who might have other priorities than you. So it is up to you to find and suggest a solution by providing a pull request - unless someone else does it. |
@stweil thank you, I will try, but the codebase is really big. Is there any help to find a short cut to the sources that might be of interest concerning the bounding box issue? |
There are numerous overlapping issues that have been raised related to this same subject. In perusing a few of them, the names that come up frequently include @Sintun @theraysmith @jbreiden @stweil @noahmetzger who seem to be knowledgeable in this area of functionality and code. Perhaps those gentlemen could give some direction on where to look in the code. This seems to be directly related to #2576 |
@clavelc, most of what you say is not related – please help keeping issues to the point!
You can do that easier with a parameter:
Yes, your columns are very close to each other, so the lines should help.
Yes you can: for this kind of table, you can easily use the |
Thanks for your answer,
Sorry for that, will do !
Thanks for the tip, I looked up --user-pattern the other day but couldn't figure out how to apply it to my table. I'll try again. Have a good day |
I have downloaded the latest master code branch version and am experimenting with the code under Ubuntu on two fronts:
|
I have just created pull request #4211 which I consider to be an improved solution for diplopia. I encourage everyone on this trail to try this out and test it with as broad a range of cases as possible. Note by the way, there are some new configuration values that can only be set in code as things stand. These configuration values are: bool kRemoveDiplopia - if true, enables diplopia removal functionality. If false, my changes have no effect Obviously if my diplopia change is of value, then these configuration items should be made into settings. |
Please refer to the following link:
#2635
This concerns changes made to lstm_choices_mode.
Unless I misunderstand what these options are supposed to do, it appears like there is a bug or oversight. Please refer to this user area thread:
https://groups.google.com/forum/#!topic/tesseract-ocr/5tC6appoUgE
There seems to be no way to prevent lstm from including duplicates in the generated text and/or HOCR output. The example in the thread above is a clear example of this.
Surely there must be some way to force Tesseract to include only the highest confidence level choice of character when there are multiple possibilities.
Also, apologies if this is posted in the wrong place, and apologies for possible duplicate postings. I am a Tesseract newbie so trying to learn the ropes.
Thanks.
The text was updated successfully, but these errors were encountered: