Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add lexical model for chechen latin #296

Merged
merged 6 commits into from
Jan 29, 2025

Conversation

gushmazuko
Copy link
Contributor

This pull request adds a lexical model for the Chechen language using the 1992 Latin script.

Word forms are sourced from the corpus at corpora.dosham.info, created by khashasin, a member of our team.

  • The model aims to improve text input accuracy and support the development of the Chechen language.
  • This was previously discussed in #294.

Let me know if any adjustments are needed. Thanks!

@keyman-server
Copy link

Thank you for your pull request. You'll see a "build failed" message until the Keyman team has reviewed the pull request and manually initiated the build process.

Every change committed to this branch will become part of this pull request. When you have finished submitting files and are ready for the Keyman team to review this pull request, please post a "Ready for review" comment.

@gushmazuko
Copy link
Contributor Author

Hi @DavidLRowe I hope you’re doing well! I just wanted to check if there are any updates regarding this pull request?

@DavidLRowe DavidLRowe merged commit 18422a6 into keymanapp:master Jan 29, 2025
2 of 3 checks passed
@DavidLRowe
Copy link
Collaborator

@gushmazuko Thanks for the reminder. PR is approved and merged. Should be available shortly.

There are some entries in the word list with [REPLACE]. Perhaps something to address in a future update?

@gushmazuko
Copy link
Contributor Author

Hello @DavidLRowe Thank you for merging the PR!

Regarding the entries with [REPLACE] in the word list—they originated from our transliteration library, which automatically adds [REPLACE] to flag cases that require manual review during the transliteration from Cyrillic to Latin. These entries inadvertently made their way into the Keyman Lexical Model dictionary during its creation.

This limitation arises because [REPLACE] is used in words with a nasal "Н" (Cyrillic Letter En) at the end (transliterated as ŋ (Latin Letter Eng) in the Latin alphabet) when it’s unclear whether to include it. In the Chechen language, there are words that are identical in Cyrillic but have different meanings, which are distinguished in Latin. For example:

  • шун - şun: table
  • шун - şuŋ: your

We would like to get your opinion or advice on the best way to handle such cases in the lexical model based on our Cyrillic corpus. Personally, I see a potential solution where we split the frequency of the word 'шун' in the Keyman Lexical Model’s frequency dictionary and transliterate them separately as 'şun' and 'şuŋ'.

Would you have any recommendations or suggestions on how to best implement this?

Looking forward to your feedback!

Best regards

@DavidLRowe
Copy link
Collaborator

I'm not an expert, but I think your approach of creating two Latin script entries for each ambiguous Cyrillic script entry makes sense. I wouldn't worry too much about getting the frequency split exactly right. For example, suppose шун had a frequency count of 100. I'm guessing 'your' is more common than 'table', so you could use 70 for şuŋ and 30 for şun. But I expect that using 80 and 20 wouldn't make much difference (depending on how many other words start with "şu").

You may want to think about how to replicate the process (or at least document your choices), so that if you revise the lexical model at a later date with new files in your corpus, you'll have a record that şuŋ / şun previously had a 80% / 20% split.

It's exciting to see this in use! You can also ask questions on the SIL's Language Software Community: https://community.software.sil.org/c/keyman/ or raise an issue on this repository: https://github.com/keymanapp/lexical-models/issues .

@gushmazuko
Copy link
Contributor Author

@DavidLRowe thank you for your feedback! I’ll try to implement this approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants