-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve line detection on skewed images #262
Comments
@eroux I am sorry for being late resolving the issue. So I have one strategy in m mind. The post processing boxes order is giving us a list in which all the box belonging to a particular line will be added in a nested list. So wat i m thinking is, since there is an information of line or segment in both google vision json and HOCR html output. I thought to keep the original information about which boxes belong to which line in a variable. after post processing we can compare the number of line differences we are having with original line information and post processed lines. if the difference is huge, it is most likely that our post processing was too strict and we can directly choose the original line information else we can the post processed line order. |
well, that's an option yes. I believe the post-processing algorithm is pretty sophisticated and flexible, I really think it can be fixed easily by tweaking a few parameters, or maybe there's a bug that could be easy to fix. Perhaps we could look at that first? |
post processing is relying on a threshold which is hard coded. Thats y we r having the issue. |
i think we need to find a way to calculate this threshold or we can go with above option. |
yes, let's tweak the threshold a bit, but for some reason I think the value is more or less fine, it's probably a bug in the algorithm, let's first fix what we have before implementing a more complex algorithm |
by tweaking the threshold, do u mean by sending threshold as parameter? |
oh I just meant hardcoding a different value |
I think that will be an issue in future with different kind of pecha |
why? the threshold is proportionate to the average stack box |
https://github.com/OpenPecha/Toolkit/blob/master/openpecha/formatters/ocr/ocr.py#L185C21-L185C21 |
well, I don't know the code by heart so I can't find the solution for you, sorry. If you feel this is too complicated just go for the other option, I just think it's a waste of time. |
please make your initial solution optional though, the reason why we developed the post-processing part is because there are serious issues in the original line information, especially for older Google OCR and I don't want to use that for things that go on BUDA |
no problem, thanks! I'll have a look |
@kaldan007 can you add a test with the image and the expected result given in the initial comment of this issue? it will be helpful to demonstrate how your change fixes it |
|
@eroux this the output i m getting after the update. |
ah sorry I meant can you add the example as a test in the repo : https://github.com/OpenPecha/Toolkit/tree/master/tests/formatters/google_vision it will make it much easier for me to look at the PR |
sure will do that |
@eroux i have included the page in the test case of hocr. |
so, I've merged @kaldan007 's PR which basically removes the post-processing when it goes wrong. Ideally we should have a better post-processing that doesn't have problems with skewed lines so that we can remove the duplication and other errors from Google OCR. Kurt's take on it is:
Something like that would be ideal to implement but I don't think we have the skills / time for that yet in the organization so closing this until then. |
for (lightly) skewed images like
https://iiif.bdrc.io/bdr:I1KG10195::I1KG101950044.jpg/full/max/0/default.jpg
the current line detection of the HOCR import gives an output of
on the following file:
00000044.zip
This is not ideal... perhaps the line break detection could be a bit more lenient?
The text was updated successfully, but these errors were encountered: