Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scspell splits words tokens with diacritics inside words #35

Open
robotdana opened this issue Nov 7, 2018 · 1 comment · May be fixed by #37
Open

scspell splits words tokens with diacritics inside words #35

robotdana opened this issue Nov 7, 2018 · 1 comment · May be fixed by #37

Comments

@robotdana
Copy link

robotdana commented Nov 7, 2018

eg. händler in python 3.7 finds a token ändler and in 2.7, finds a token ndler.

The same is also an issue for words with other diacritics

robotdana added a commit to robotdana/scspell that referenced this issue Nov 30, 2018
Previously: `Händler` would be tokenized as `ndler` or `ändler` depending on python version
Rather than the expected `händler`

Solution: use `regexp` rather than `re`.
This gives us the ability to use unicode character clasess such as `[[:upper:]]` and `[[:lower:]]`

Fixes myint#35
robotdana added a commit to robotdana/scspell that referenced this issue Nov 30, 2018
Previously: `Händler` would be tokenized as `ndler` or `ändler` depending on python version
Rather than the expected `händler`

Solution: use `regexp` rather than `re`.
This gives us the ability to use unicode character clasess such as `[[:upper:]]` and `[[:lower:]]`

Fixes myint#35
robotdana added a commit to robotdana/scspell that referenced this issue Nov 30, 2018
Previously: `Händler` would be tokenized as `ndler` or `ändler` depending on python version
Rather than the expected `händler`

Solution: use `regexp` rather than `re`.
This gives us the ability to use unicode character clasess such as `[[:upper:]]` and `[[:lower:]]`

Fixes myint#35
robotdana added a commit to robotdana/scspell that referenced this issue Nov 30, 2018
Previously: `Händler` would be tokenized as `ndler` or `ändler` depending on python version
Rather than the expected `händler`

Solution: use `regexp` rather than `re`.
This gives us the ability to use unicode character clasess such as `[[:upper:]]` and `[[:lower:]]`

unicodedata.normalize is because travis was working differently than my mac

Fixes myint#35
robotdana added a commit to robotdana/scspell that referenced this issue Nov 30, 2018
Previously: `Händler` would be tokenized as `ndler` or `ändler` depending on python version
Rather than the expected `händler`

Solution: use `regexp` rather than `re`.
This gives us the ability to use unicode character clasess such as `[[:upper:]]` and `[[:lower:]]`

Fixes myint#35
@kkmuffme
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants