-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance Kanji Recognition in Japanese Language Detection #381
base: main
Are you sure you want to change the base?
Conversation
Updated the character set initialization to include "Japanese_Han" in constant.rs. Also added the corresponding JHAN constant in script.rs to support the new character set.
Updated the character set initialization to include "Japanese_Han" in constant.rs. Also added the corresponding JHAN constant in script.rs to support the new character set.
Increased CJK language uncertainty max ratio from 0.4 to 0.6. Added increment call for Chinese when both Chinese and Japanese are present. Removed premature return for Japanese and added a new decrement_counter method for future use.
Eliminated unnecessary `FromStr` import from `language.rs` and `Itertools` import from `model.rs`. These imports were not being used, thus removing them improves code cleanliness and reduces potential confusion.
The CJK_lang_uncertainty variables are renamed to cjk_lang_uncertainty for consistency in naming conventions. Additionally, adjusted cjk_lang_uncertainty_max_ratio for improved accuracy and removed the unused decrement_counter function from the code.
Revision updates show decreased accuracy figures for Chinese language detection across both high and low accuracy reports. The aggregated accuracy values were also modified accordingly to reflect the updated results.
Improved high and low accuracy statistics in Chinese accuracy reports. Corrected mappings for 'い' to 'あ', among other script character updates in `script.rs`.
Refactor code to correctly increment language counters and update logic for handling uncertainty between Chinese and Japanese languages. Adjust accuracy reports to reflect updated detection accuracy metrics.
It also fixes these issues in the python release |
Hi Michael, thank you very much for this PR. Finally there is someone who knows Chinese and Japanese well enough to help me distinguish them better. Awesome. :) As I'm planning to make a new release of my library in October, I will evaluate your changes soon and most likely merge them, eventually. Great work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add unit tests for the test cases you have listed in your PR.
&& total_language_counts.contains_key(&Some(Language::Chinese)) | ||
&& total_language_counts.contains_key(&Some(Language::Japanese)) | ||
&& (cjk_lang_uncertainty as f32 / words.len() as f32) >= cjk_lang_uncertainty_max_ratio | ||
&& self.is_low_accuracy_mode_enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this rule applied in low accuracy mode only? The rule engine should operate independently of the selected accuracy mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's due to to the fact that on low accuracy mode, the lingua-rs doesn't use the n-gram model after running detect_language_with_rules (idk if that's right). Regardless, by adding this case in, a lot more Chinese words get recognized as Chinese in low accuracy. Otherwise, they would be misidentified as unknown. If you want I can move this logic to compute_language_confidence_values_for_languages .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you want me to add the unit tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add a new unit test method in file detector.rs
. I think it's best to use a parameterized test method. Just take a look at the other test methods and do it analogously.
src/detector.rs
Outdated
@@ -896,12 +905,28 @@ impl LanguageDetector { | |||
if total_language_counts.len() == 2 | |||
&& cfg!(feature = "chinese") | |||
&& cfg!(feature = "japanese") | |||
&& total_language_counts.contains_key(&Some(Language::from_str("Chinese").unwrap())) | |||
&& total_language_counts.contains_key(&Some(Language::from_str("Japanese").unwrap())) | |||
&& total_language_counts.contains_key(&Some(Language::Chinese)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace Language::Chinese
with Language::from_str("Chinese")
as it was before. The same goes for Japanese. Otherwise, the code won't compile if Chinese and / or Japanese are not among the selected language dependencies.
Replace direct enum references with `Language::from_str`. This change standardizes how language enums are handled, improving code readability and consistency, especially when additional languages are added.
It doesn't have Chinese-derived readings. (Similar as 辻, 畑, and 榊) |
@pemistahl hello. The JVM-based Lingua library can also benefit from the Japanese-language enhancements introduced in the current PR. I can help out with adding these changes in Kotlin. Please let me know your thoughts |
@michaelbennieUFL and @tats-u , You have the list as a CSV in the following repo that you can download and parse out: https://github.com/sph-mn/nihongo/blob/master/data/jouyou-kanji.csv |
Many of them are used also in China (including Taiwan and Hong Kong). https://github.com/orgs/meilisearch/discussions/532#discussioncomment-3705382 Examples of exceptions: 畑: created in Japan (not used in Chinese) |
Related Issues
Solution
This pull request improves the language detection between Japanese and Chinese, specifically when dealing with Kanji characters that are common in Japanese texts. By introducing a new script called
Japanese_Han
(abbreviated asJHAN
), which includes Kanji characters commonly used in Japan, the detection logic is adjusted to better distinguish Japanese from Chinese when only Kanji characters are present. While there is a slight decrease in accuracy for single and dual-character inputs in the high-accuracy model, the overall sentence-level accuracy remains at 100%. This enhancement allows for accurate detection of Japanese sentences that use only Kanji.Context
In previous versions, the language detector struggled to differentiate between Japanese and Chinese texts that contained only Kanji characters. This issue arose because both languages share many Han characters (Kanji in Japanese), leading to incorrect classification of Japanese Kanji-only texts as Chinese with high confidence. For example, words like "経済" (economy), "労働" (labor), and "勉強中" (studying) are uniquely Japanese but were being detected as Chinese.
Technical Approach
1. Introducing
Japanese_Han
ScriptJapanese_Han
(JHAN
) that contains Kanji characters commonly used in Japanese.JHAN
script is added to thesrc/script.rs
file with ranges of Kanji characters specific to Japanese usage.2. Adjusting Detection Logic
Updated the
JAPANESE_CHARACTER_SET
insrc/constant.rs
to use the newJapanese_Han
script:Modified the
detect_language_with_rules
method insrc/detector.rs
:JAPANESE_CHARACTER_SET
andAlphabet::Han
.JAPANESE_CHARACTER_SET
, it increments the count for Japanese.Alphabet::Han
, it increments the count for Chinese.3. Handling Uncertainty
cjk_lang_uncertainty
to track cases where it's challenging to distinguish between Chinese and Japanese.is_low_accuracy_mode_enabled
is true.Limits
4. Test Cases and Accuracy Reports