Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ocr): checking the abnormal post correction feature added #264

Merged
merged 2 commits into from
Dec 21, 2023
Merged

Conversation

kaldan007
Copy link
Contributor

We have noticed that our post correction of character order bit strict in some case hence resulting in very unexpected output. This PR has one function added to check the abnormal postcorrection and a flag in ocr formatter object also in order to go through the checking. If the checking of postcorrection flag is true and the function find abnormality in the post correction, it would use the original line and character order given by google ocr output else it will use the post corrected one. The google vision formatter's checking postcorrection flag is by default false. Hence it will use the postcorrected order by default. The function checking the postcorrection is here.

Copy link
Contributor

@eroux eroux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! Can you describe a little bit what the change does and add a test demonstrating the problem?

@eroux
Copy link
Contributor

eroux commented Dec 21, 2023

detecting skewed lines in the right way is just more effort than what I can do right now, let's just merge that and import from GB, ideally in the future we should implement a proper line detection algorithm

Copy link
Contributor

@eroux eroux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok for me

@eroux eroux merged commit e94d65d into master Dec 21, 2023
3 checks passed
@eroux eroux deleted the fix-ocr branch December 21, 2023 07:39
@ngawangtrinley
Copy link
Contributor

The main issue is curved and wobbly lines. It's unfortunately very common in woodblock printed material when the page moves to either side under the roller (https://www.youtube.com/watch?v=vow3YY9FnxY) and/or at scanning time when using a page feed scanner without an extra support for very long pages (something longer than the white support here: https://m.media-amazon.com/images/I/81KGnw1cd7L._AC_SX466_.jpg). I couldn't find a tool/script that does this out of the box so I think we will need to train a specialized CV model just for this. Once the curve/wobble is straightened, splitting lines is easy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants