Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add traditional vs simplified chinese support #16

Open
asg0451 opened this issue Jun 30, 2021 · 5 comments
Open

Add traditional vs simplified chinese support #16

asg0451 opened this issue Jun 30, 2021 · 5 comments
Labels
new language A new language the LanguageDetector can recognize

Comments

@asg0451
Copy link

asg0451 commented Jun 30, 2021

Hi! I'm considering using Lingua in a project. It compares favourably to CLD2 and Whatlang on our dataset (social media posts), but one of our requirements is that we need to distinguish between traditional and simplified chinese, which Lingua does not support.

Are there any plans to support this? Our requirement for Chinese support probably won't be crucial until later in the year, so if support is in development that would go a long way.

Thanks!

@pemistahl
Copy link
Owner

Hello Miles, thanks for your request.

Back then, when I added support for Chinese, I did not find proper training corpora that consisted only of traditional or simplified Chinese, respectively. That's why Lingua cannot differentiate between them yet. Do you know of a good source for training material perhaps? I can start a search myself again as well. If successful, adding support for traditional and simplified Chinese won't be difficult anymore.

@pemistahl pemistahl added the new language A new language the LanguageDetector can recognize label Jul 1, 2021
@asg0451
Copy link
Author

asg0451 commented Jul 1, 2021

Thanks for the reply!

No, I don't know of any specific training corpora -- but it's my understanding that traditional vs simplified chinese typically have different character sets, so it may be possible to distinguish them without a ML model. In fact, perhaps I could use such a thing...

Things such as: https://github.com/magiclen/opencc-rust and unicode properties (https://www.unicode.org/reports/tr38/#kTraditionalVariant)

@pemistahl
Copy link
Owner

OpenCC is supported on Linux only which makes it a non-feasible solution for my library. And don't forget that there could be foreign language material in Chinese texts. So I think we won't get around creating ML models. I will try to find some good training data again, perhaps I will be more lucky this time.

@asg0451
Copy link
Author

asg0451 commented Jul 7, 2021

Awesome! In the meantime, I'm using some heuristics based on lookup tables of variant-specific characters.

@ok3721
Copy link

ok3721 commented Jul 12, 2024

Sorry to reply to an old issue, I was wondering that is the OpenCC dictionary usable as training data? The STPhrases.txt contains ~50K Chinese phrases in Simplified-Traditional pairs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new language A new language the LanguageDetector can recognize
Projects
None yet
Development

No branches or pull requests

3 participants