Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode-based split of words and graphemes #4

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ichianr
Copy link

@ichianr ichianr commented Jun 9, 2022

What:

  • Inputs with non-latin alphabets are now processed correctly.

Why:

  • The original regex [\w\\']+ only matches latin alphabets, thus non-latin inputs were not processed at all.
  • The &str.len() is not always the same as the length of the Unicode graphemes, and indices in style_substr were calculated wrongly for multibyte characters.

How:

  • Changed the word-split algorithm from regex to unicode_word_indices
  • Changed the character indexing from &str slice to UnicodeSegmentation::graphemes

Tests:

  • English
    echo 'The Quick Brown Fox Jumps Over The Lazy Dog' | fsrx
  • French
    echo "Le cœur déçu mais l'âme plutôt naïve, Louÿs rêva de crapaüter en canoë au-delà des îles, près du mälström où brûlent les novæ" | fsrx
  • Korean
    echo '키스의 고유 조건은 입술끼리 만나야 하고 특별한 기술은 필요치 않다.' | fsrx

Checklist:

  • Allow edits from maintainers option checked
  • Branch name is prefixed with [your_username]/ (ex. coloradocolby/featureX)
  • Documentation added
  • Tests added
  • No failing actions
  • Merge ready

Caveat:

  • I comfirmed that languages putting spaces between words are processed quite similarly to English. However, it seems that fsrx's algorithm (and Bionic Reading) cannot be applicable to some Asian languages that do not use spaces (such as Japanese and Chinese.)
  • Some terminal emulators (e.g., Alacritty, if I remember correctly) may not properly support Unicode input / output. I tested my code with xfce4-terminal.
  • For non-latin alphabets, I tested my code with D2Coding font, but other fonts such as Noto will work as well.

@jrnxf
Copy link
Owner

jrnxf commented Jun 9, 2022

damn @ichianr this looks amazing. I don't have time rn to look it over but will tonight!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants