-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add lexical model for chechen latin #296
Conversation
add lexical model for chechen latin
move to release & some fixes
Thank you for your pull request. You'll see a "build failed" message until the Keyman team has reviewed the pull request and manually initiated the build process. Every change committed to this branch will become part of this pull request. When you have finished submitting files and are ready for the Keyman team to review this pull request, please post a "Ready for review" comment. |
Hi @DavidLRowe I hope you’re doing well! I just wanted to check if there are any updates regarding this pull request? |
@gushmazuko Thanks for the reminder. PR is approved and merged. Should be available shortly. There are some entries in the word list with |
Hello @DavidLRowe Thank you for merging the PR! Regarding the entries with This limitation arises because [REPLACE] is used in words with a nasal "Н" (Cyrillic Letter En) at the end (transliterated as ŋ (Latin Letter Eng) in the Latin alphabet) when it’s unclear whether to include it. In the Chechen language, there are words that are identical in Cyrillic but have different meanings, which are distinguished in Latin. For example:
We would like to get your opinion or advice on the best way to handle such cases in the lexical model based on our Cyrillic corpus. Personally, I see a potential solution where we split the frequency of the word 'шун' in the Keyman Lexical Model’s frequency dictionary and transliterate them separately as 'şun' and 'şuŋ'. Would you have any recommendations or suggestions on how to best implement this? Looking forward to your feedback! Best regards |
I'm not an expert, but I think your approach of creating two Latin script entries for each ambiguous Cyrillic script entry makes sense. I wouldn't worry too much about getting the frequency split exactly right. For example, suppose шун had a frequency count of 100. I'm guessing 'your' is more common than 'table', so you could use 70 for şuŋ and 30 for şun. But I expect that using 80 and 20 wouldn't make much difference (depending on how many other words start with "şu"). You may want to think about how to replicate the process (or at least document your choices), so that if you revise the lexical model at a later date with new files in your corpus, you'll have a record that şuŋ / şun previously had a 80% / 20% split. It's exciting to see this in use! You can also ask questions on the SIL's Language Software Community: https://community.software.sil.org/c/keyman/ or raise an issue on this repository: https://github.com/keymanapp/lexical-models/issues . |
@DavidLRowe thank you for your feedback! I’ll try to implement this approach. |
This pull request adds a lexical model for the Chechen language using the 1992 Latin script.
Word forms are sourced from the corpus at corpora.dosham.info, created by khashasin, a member of our team.
Let me know if any adjustments are needed. Thanks!