-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with Japanese and numbers (digits) #973
Comments
Thanks for reporting. I'll look into it later. Are you using the docker images? |
@eikek Sorry for the late response. Yes, I'm using Docker. |
No worries! I'm off for doing something serious for a week or two anyways:-). Thanks for the research and the test paper (this is really helpful!), it seems we need to download a different training set for Japanese. Hope this fixes the problem. Otherwise, I'm quite lost. |
They seem to work better as suggested here: tesseract-ocr/tessdata#119 Refs: #973
Hi @wallace11 I changed the joex docker image by adding the other training data. It is shortly available via the |
@eikek Only issue is that it tends to insert unnecessary spaces (in Japanese you rarely need to use spaces). I'm guessing it's the best we can achieve right now which is 95% there in terms of accuracy and usability - for me it's perfect. Thank you! |
Hi @wallace11 thanks you, great to hear! |
Hi there,
Following up #948 and #962, I tested a couple of Japanese documents and the whole process went flawlessly.
To my surprise, the only problem I had is with numbers.
For some reason, roman numbers are converted to circled numbers.
Besides being incorrect, this messes up the date recognition because 2016年10月25日 is being recognized as ⑳①⑥ 年 ①0 月 ②⑤ 日 (weird, right?)
I tried to find a solution and came across this issue in the tessdata repo, which explains the issue in more details and has a potential solution (not sure that they were talking about, exactly).
tesseract-ocr/tessdata#119
I wanted to find a "good" paper for sharing here for people to test on, but all the "good" documents I've got contain personal information, etc. so I just used the back of a movie ticket that I had lying around. You'll notice all the numbers at the bottom become circled after OCR.
z-20210731-145217.pdf
Thanks!
The text was updated successfully, but these errors were encountered: