Update Greek dictionary for linguistic correctness and completeness #416

agoatboi · 2025-03-18T10:05:01Z

The goal of this PR is to manually check the Greek dictionary and replace all occurances of:

Incorrect spellings
Incorrect stress signs
English names transliterated in Greek
Words with double stress symbol (valid, but only appears in particular sentence structuring).

With alternatives featuring:

Common words (and a few scientific or vernacular/idiomatic ones)
Correctly stressed word variations
Greek names (historical or current)
Occasional words stressed with (¨) "διαλυτικά" and (΅) "διαλυτικά με τόνο" which infrequently come up in writing.

In all replacements an effort was made to keep the letter distribution largely the same, but I am relatively sure that the average length of the words has been increased. The file was consistently checked for duplicates, and none were created.

semanticdiff-com · 2025-03-18T10:05:04Z

Review changes with

Changed Files

File	Status
packages/keybr-keyboard/lib/language.ts	88% smaller
packages/keybr-keyboard-io/lib/parser/diacritics.ts	0% smaller
packages/keybr-keyboard/lib/layout/el_gr.ts	0% smaller
packages/keybr-phonetic-model/assets/model-el.data	Unsupported file format

aradzie · 2025-03-20T17:16:39Z

George, thank you for your contribution!
I know from experience that creating a quality word list is a tedious and time consuming task.
(I can't count how many hours I spent on the English list.)
That being said, the file that you are modifying is automatically generated, your changes will be lost, and we use a different approach.
It's a bit complicated, so let me explain.

First, we have a separate repository where we develop the word frequency dictionaries.
I compared your changes with the existing file, the list sizes are the same, so I assume you only fixed the existing words, and did not insert or delete words.
Based on this assumption I replaced the words in the original corpus repository.
Next, I passed the words thought an automatic spell checker to filter out any invalid words.
Then I created a spreadsheet in Google docs.
This is the usual approach, I create a spreadsheet for a native speaker to review. It seems to be easier this way because it does not require any coding from the reviewer.

The spreadsheet has three columns. If the third column is not empty, then it contains the original word before you fixed it, and such a row is highlighted in a separate color.
There is also a separate sheet rejected with the words rejected by the spell checker. You may want to take a look at it.

So, let me ask you to review the spreadsheet.
If you find a spelling error, just fix the word in the first column.
If you find an invalid word, then delete the row or simply blank out the cell. (Put the cursor on the invalid word then press Backspace.)

If you feel enthusiastic, the please also remove any potentially triggering words. As an example take a look at the blacklist files for the English language -- profanity.txt and sensitive.txt

Thanks again!

aradzie · 2025-03-20T18:09:10Z

Hey, I had an idea. I took the original English blacklist of sensitive words and asked an AI bot to translate it to Greek. The translation may not be perfect, but it's a start. Here's the result

The sensitive words are any words about race, ideology, religion or sex. They can create random combinations that can trigger some people.

Again, if you don't feel too enthusiastic, just skip this part.

So, please take a look at the updated spreadsheet. Let me know when you are done then I'll update the web site right away.

aradzie · 2025-03-20T18:22:46Z

I just found out that you can use the built-in Google Docs spell checker to review words quickly. It's a good start to automatically check the words before doing any manual work. The spell checker is available in the menu "Tools" -> "Spelling" -> "Spell check".

agoatboi · 2025-03-20T18:29:15Z

Hi aradzie,

Thanks a lot for your enthusiastic response and wealth of information!

I actually understood I had modified the wrong file a little after finishing my (20hour plus 💀) run, when I tried to run the development server locally and, obviously, the changes weren't integrated. It was at this point that I marked the PR as a draft, hoping to work on it over the weekend. I now realize I should have also communicated this in the comments of the PR, but honestly it seemed like it was almost ready.

Unfortunately, what you have in the spreadsheet with the 1-1 mapping won't quite work, because I replaced some seemingly frequent -but wrong- words with rather rarer ones. When I realized that word frequencies had to be involved, I postponed this until I could find a suitable dataset to estimate them, or to instead sort them by hand based on a heuristic/intuitive sense. This will take me some time to do still, but if you have any good ideas I'm happy to hear them.

I am now at the point where I've read the documentation further and decided to update both the layout and language to add the διαλυτικα με τονο symbol appearing in a few words, but I can't quite get it to work. It seems as though the alphabet variable is cached somewhere, because not even by deleting symbols from it can I get the profile statistics page to update. I got lost between the various nested components trying to understand how it works, and regretfully, I'm not proficient in JS.

Could I perhaps push the commits I have here so that you can guide me on how to do it? I think I'm very close.

As for the sensitive words, I found out a few while going through the list, and in fact added very few of them myself since they (rather unsurprisingly) show up often in Greek texts and occasionally have unique trigrams the algorithm could help you get used to. I am willing to go through the list again and mark those out for you. Perhaps their inclusion could be an optional setting to be toggled as desired? I am not sure if that is already implemented.

agoatboi · 2025-03-20T18:39:18Z

Force pushed the commit since I'm removing the original about changing the json of words. I'll instead make a PR to the corpus repository with updated words and frequencies when I can muster to courage to estimate the latter.

aradzie · 2025-03-20T19:27:39Z

There must be a way to salvage your work. 20 hours is a lot of time. You changed around 1076 words in the original list. I think we can assume that the remaining ~9000 words are valid.

I added a new sheet with only changed words. Maybe you can review this smaller list? If a word is misspelled, fix it. If it is invalid, make it blank. I will find a way to reconcile your changes in the spreadsheet to the corpus repository.

Speaking of the corpus repository. The original word frequency dictionary is el_50k.csv. (I think it comes from parsing a movie subtitles database, but I don't remember exactly.) It has around 50000 words, so it's not a problem to cull it aggressively, there is a plenty of room.

If you want to add a pull request, then it can be as simple as a text file with invalid words named lang-el/blackilst-xyz.txt. The corpus repository is a bunch of one time throwaway scripts. For this reason I do not pay too much attention to the quality of the scripts, and you probably shouldn't either ;)

agoatboi marked this pull request as draft March 18, 2025 12:07

aradzie force-pushed the master branch from dc93f80 to 23a157c Compare March 20, 2025 15:21

agoatboi force-pushed the master branch from 2eca528 to d2b1605 Compare March 20, 2025 18:38

add dialytika tonos symbol in Greek

06525a6

agoatboi force-pushed the master branch from d2b1605 to 06525a6 Compare March 20, 2025 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Greek dictionary for linguistic correctness and completeness #416

Update Greek dictionary for linguistic correctness and completeness #416

agoatboi commented Mar 18, 2025

semanticdiff-com bot commented Mar 18, 2025 •

edited

Loading

aradzie commented Mar 20, 2025

aradzie commented Mar 20, 2025

aradzie commented Mar 20, 2025

agoatboi commented Mar 20, 2025 •

edited

Loading

agoatboi commented Mar 20, 2025

aradzie commented Mar 20, 2025

Update Greek dictionary for linguistic correctness and completeness #416

Are you sure you want to change the base?

Update Greek dictionary for linguistic correctness and completeness #416

Conversation

agoatboi commented Mar 18, 2025

semanticdiff-com bot commented Mar 18, 2025 • edited Loading

aradzie commented Mar 20, 2025

aradzie commented Mar 20, 2025

aradzie commented Mar 20, 2025

agoatboi commented Mar 20, 2025 • edited Loading

agoatboi commented Mar 20, 2025

aradzie commented Mar 20, 2025

semanticdiff-com bot commented Mar 18, 2025 •

edited

Loading

agoatboi commented Mar 20, 2025 •

edited

Loading