add lexical model for chechen latin #296

gushmazuko · 2025-01-25T14:45:02Z

This pull request adds a lexical model for the Chechen language using the 1992 Latin script.

Word forms are sourced from the corpus at corpora.dosham.info, created by khashasin, a member of our team.

The model aims to improve text input accuracy and support the development of the Chechen language.
This was previously discussed in #294.

Let me know if any adjustments are needed. Thanks!

add lexical model for chechen latin

…echen

move to release & some fixes

keyman-server · 2025-01-25T14:46:23Z

Thank you for your pull request. You'll see a "build failed" message until the Keyman team has reviewed the pull request and manually initiated the build process.

Every change committed to this branch will become part of this pull request. When you have finished submitting files and are ready for the Keyman team to review this pull request, please post a "Ready for review" comment.

gushmazuko · 2025-01-29T09:51:27Z

Hi @DavidLRowe I hope you’re doing well! I just wanted to check if there are any updates regarding this pull request?

DavidLRowe · 2025-01-29T19:27:08Z

@gushmazuko Thanks for the reminder. PR is approved and merged. Should be available shortly.

There are some entries in the word list with [REPLACE]. Perhaps something to address in a future update?

gushmazuko · 2025-01-30T17:09:46Z

Hello @DavidLRowe Thank you for merging the PR!

Regarding the entries with [REPLACE] in the word list—they originated from our transliteration library, which automatically adds [REPLACE] to flag cases that require manual review during the transliteration from Cyrillic to Latin. These entries inadvertently made their way into the Keyman Lexical Model dictionary during its creation.

This limitation arises because [REPLACE] is used in words with a nasal "Н" (Cyrillic Letter En) at the end (transliterated as ŋ (Latin Letter Eng) in the Latin alphabet) when it’s unclear whether to include it. In the Chechen language, there are words that are identical in Cyrillic but have different meanings, which are distinguished in Latin. For example:

шун - şun: table
шун - şuŋ: your

We would like to get your opinion or advice on the best way to handle such cases in the lexical model based on our Cyrillic corpus. Personally, I see a potential solution where we split the frequency of the word 'шун' in the Keyman Lexical Model’s frequency dictionary and transliterate them separately as 'şun' and 'şuŋ'.

Would you have any recommendations or suggestions on how to best implement this?

Looking forward to your feedback!

Best regards

DavidLRowe · 2025-01-30T19:44:56Z

I'm not an expert, but I think your approach of creating two Latin script entries for each ambiguous Cyrillic script entry makes sense. I wouldn't worry too much about getting the frequency split exactly right. For example, suppose шун had a frequency count of 100. I'm guessing 'your' is more common than 'table', so you could use 70 for şuŋ and 30 for şun. But I expect that using 80 and 20 wouldn't make much difference (depending on how many other words start with "şu").

You may want to think about how to replicate the process (or at least document your choices), so that if you revise the lexical model at a later date with new files in your corpus, you'll have a record that şuŋ / şun previously had a 80% / 20% split.

It's exciting to see this in use! You can also ask questions on the SIL's Language Software Community: https://community.software.sil.org/c/keyman/ or raise an issue on this repository: https://github.com/keymanapp/lexical-models/issues .

gushmazuko · 2025-02-01T13:47:54Z

@DavidLRowe thank you for your feedback! I’ll try to implement this approach.

gushmazuko and others added 6 commits January 21, 2025 17:42

add lexical model for chechen latin

f75852f

fix: remove extra comma

cc3eb2d

Merge pull request #1 from chechen-language/chechen_latin

104d7d8

add lexical model for chechen latin

Clarify data permission in README.md

9166464

move to release & rename chechen_latin to chechen_language.ce-latn.ch…

72d1456

…echen

Merge pull request #2 from chechen-language/chechen_latin

1cc51fd

move to release & some fixes

DavidLRowe approved these changes Jan 29, 2025

View reviewed changes

DavidLRowe merged commit 18422a6 into keymanapp:master Jan 29, 2025
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add lexical model for chechen latin #296

add lexical model for chechen latin #296

gushmazuko commented Jan 25, 2025

keyman-server commented Jan 25, 2025

gushmazuko commented Jan 29, 2025

DavidLRowe commented Jan 29, 2025

gushmazuko commented Jan 30, 2025

DavidLRowe commented Jan 30, 2025

gushmazuko commented Feb 1, 2025

add lexical model for chechen latin #296

add lexical model for chechen latin #296

Conversation

gushmazuko commented Jan 25, 2025

keyman-server commented Jan 25, 2025

gushmazuko commented Jan 29, 2025

DavidLRowe commented Jan 29, 2025

gushmazuko commented Jan 30, 2025

DavidLRowe commented Jan 30, 2025

gushmazuko commented Feb 1, 2025