Unicode-based split of words and graphemes #4

ichianr · 2022-06-09T08:33:42Z

What:

Why:

The original regex [\w\\']+ only matches latin alphabets, thus non-latin inputs were not processed at all.
The &str.len() is not always the same as the length of the Unicode graphemes, and indices in style_substr were calculated wrongly for multibyte characters.

How:

Changed the word-split algorithm from regex to unicode_word_indices
Changed the character indexing from &str slice to UnicodeSegmentation::graphemes

Tests:

English
echo 'The Quick Brown Fox Jumps Over The Lazy Dog' | fsrx
French
echo "Le cœur déçu mais l'âme plutôt naïve, Louÿs rêva de crapaüter en canoë au-delà des îles, près du mälström où brûlent les novæ" | fsrx
Korean
echo '키스의 고유 조건은 입술끼리 만나야 하고 특별한 기술은 필요치 않다.' | fsrx

Checklist:

Caveat:

I comfirmed that languages putting spaces between words are processed quite similarly to English. However, it seems that fsrx's algorithm (and Bionic Reading) cannot be applicable to some Asian languages that do not use spaces (such as Japanese and Chinese.)
Some terminal emulators (e.g., Alacritty, if I remember correctly) may not properly support Unicode input / output. I tested my code with xfce4-terminal.
For non-latin alphabets, I tested my code with D2Coding font, but other fonts such as Noto will work as well.

jrnxf · 2022-06-09T17:30:07Z

damn @ichianr this looks amazing. I don't have time rn to look it over but will tonight!

Unicode-based split of words and graphemes

8395a72

Provide feedback