Numbers in Japanese use the wrong Unicode characters in output #3646

kuro68k · 2021-11-16T14:37:38Z

Environment

Tesseract Version: v5.0.0-alpha.20210811

tesseract v5.0.0-alpha.20210811
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX
 Found SSE4.1
 Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5
 Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0

Platform: Windows 8.1 x64

Current Behavior:

Sample image:

Numbers use Unicode characters with numbers inside a circle, e.g.
リーズの歴史は、①⑨⑧⑤年① ①月に

Expected Behavior:

Should use numbers like this
リーズの歴史は、1985年11月に

or wide characters like this
リーズの歴史は、１９８５年１１月に

Suggested Fix:

Use different Unicode characters. A switch to select normal or wide might be useful as the choice will depend on what you intend to do with the text.

The text was updated successfully, but these errors were encountered:

amitdo · 2021-11-17T04:45:15Z

This issue was already been reported here:
tesseract-ocr/tessdata#119

stweil added the duplicate label Nov 17, 2021

kuro68k closed this as completed Nov 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numbers in Japanese use the wrong Unicode characters in output #3646

Numbers in Japanese use the wrong Unicode characters in output #3646

kuro68k commented Nov 16, 2021

amitdo commented Nov 17, 2021

Numbers in Japanese use the wrong Unicode characters in output #3646

Numbers in Japanese use the wrong Unicode characters in output #3646

Comments

kuro68k commented Nov 16, 2021

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

amitdo commented Nov 17, 2021