Unicode-based split of words and graphemes #4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What:
Why:
[\w\\']+
only matches latin alphabets, thus non-latin inputs were not processed at all.&str.len()
is not always the same as the length of the Unicode graphemes, and indices instyle_substr
were calculated wrongly for multibyte characters.How:
unicode_word_indices
&str
slice toUnicodeSegmentation::graphemes
Tests:
echo 'The Quick Brown Fox Jumps Over The Lazy Dog' | fsrx
echo "Le cœur déçu mais l'âme plutôt naïve, Louÿs rêva de crapaüter en canoë au-delà des îles, près du mälström où brûlent les novæ" | fsrx
echo '키스의 고유 조건은 입술끼리 만나야 하고 특별한 기술은 필요치 않다.' | fsrx
Checklist:
Allow edits from maintainers
option checked[your_username]/
(ex.coloradocolby/featureX
)Caveat:
fsrx
's algorithm (and Bionic Reading) cannot be applicable to some Asian languages that do not use spaces (such as Japanese and Chinese.)xfce4-terminal
.D2Coding
font, but other fonts such as Noto will work as well.