Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numbers in Japanese use the wrong Unicode characters in output #3646

Closed
kuro68k opened this issue Nov 16, 2021 · 1 comment
Closed

Numbers in Japanese use the wrong Unicode characters in output #3646

kuro68k opened this issue Nov 16, 2021 · 1 comment

Comments

@kuro68k
Copy link

kuro68k commented Nov 16, 2021

Environment

  • Tesseract Version: v5.0.0-alpha.20210811
tesseract v5.0.0-alpha.20210811
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX
 Found SSE4.1
 Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5
 Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0
  • Platform: Windows 8.1 x64

Current Behavior:

Sample image:
image

Numbers use Unicode characters with numbers inside a circle, e.g.
リーズの歴史は、①⑨⑧⑤年① ①月に

Expected Behavior:

Should use numbers like this
リーズの歴史は、1985年11月に

or wide characters like this
リーズの歴史は、1985年11月に

Suggested Fix:

Use different Unicode characters. A switch to select normal or wide might be useful as the choice will depend on what you intend to do with the text.

@amitdo
Copy link
Collaborator

amitdo commented Nov 17, 2021

This issue was already been reported here:
tesseract-ocr/tessdata#119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants