-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Armenian letters should be lowercased #328
base: main
Are you sure you want to change the base?
Conversation
hello @NarHakobyan, your PR fits the need, however the PR is not passing the CI, I know it's not completely related to your work but could you fix the errors? clippy:
Rust FMT: Diff in /home/runner/work/charabia/charabia/charabia/src/normalizer/lowercase.rs:27:
fn should_normalize(&self, token: &Token) -> bool {
// https://en.wikipedia.org/wiki/Letter_case#Capitalisation
- matches!(token.script, Script::Latin | Script::Cyrillic | Script::Greek | Script::Georgian | Script::Armenian)
- && token.lemma.chars().any(char::is_uppercase)
+ matches!(
+ token.script,
+ Script::Latin | Script::Cyrillic | Script::Greek | Script::Georgian | Script::Armenian
+ ) && token.lemma.chars().any(char::is_uppercase)
}
} thank you! |
Hi @ManyTheFish, Done! I Do not know why but RustRover didn't show any error on these lines. |
Hey @NarHakobyan, charabia/charabia/src/normalizer/lowercase.rs Lines 45 to 98 in d929c01
You just have to add a source token in the tokens() list, then fill the normalizer_result() and the normalized_tokens() with the expected output. |
@ManyTheFish to be honest, I don't know how to do that :D here is an example text to which can be used: |
Add a token containing Armenian capital letters in the fn tokens() -> Vec<Token<'static>> {
vec![Token {
lemma: Owned("PascalCase".to_string()),
char_end: 10,
byte_end: 10,
script: Script::Latin,
..Default::default()
- }]
+ },
+ Token {
+ lemma: Owned("ֆիզիկոսը".to_string()),
+ char_end: 8,
+ byte_end: 16,
+ script: Script::Armenian,
+ ..Default::default()
+ }]
} Then run the tests: And fix the outputs in the |
@ManyTheFish could you please run a tests? |
charabia/src/normalizer/lowercase.rs
Outdated
..Default::default() | ||
}, | ||
Token { | ||
lemma: Owned("ֆիզիկոսը".to_string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @NarHakobyan,
sorry if I wasn't clear enough, I don't know Armenian at all so I can't say if this string contains a capital letter,
However, there is no difference between the original version and the normalized one, is it a bug or is it because the Armenian text doesn't contain any capital letter?
@ManyTheFish is it possible to include this changes in next version? |
Fixes #325