Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Greek dictionary for linguistic correctness and completeness #416

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

agoatboi
Copy link

The goal of this PR is to manually check the Greek dictionary and replace all occurances of:

  • Incorrect spellings
  • Incorrect stress signs
  • English names transliterated in Greek
  • Words with double stress symbol (valid, but only appears in particular sentence structuring).

With alternatives featuring:

  • Common words (and a few scientific or vernacular/idiomatic ones)
  • Correctly stressed word variations
  • Greek names (historical or current)
  • Occasional words stressed with (¨) "διαλυτικά" and (΅) "διαλυτικά με τόνο" which infrequently come up in writing.

In all replacements an effort was made to keep the letter distribution largely the same, but I am relatively sure that the average length of the words has been increased. The file was consistently checked for duplicates, and none were created.

Copy link

semanticdiff-com bot commented Mar 18, 2025

Review changes with  SemanticDiff

Changed Files
File Status
  packages/keybr-keyboard/lib/language.ts  88% smaller
  packages/keybr-keyboard-io/lib/parser/diacritics.ts  0% smaller
  packages/keybr-keyboard/lib/layout/el_gr.ts  0% smaller
  packages/keybr-phonetic-model/assets/model-el.data Unsupported file format

@agoatboi agoatboi marked this pull request as draft March 18, 2025 12:07
@aradzie
Copy link
Owner

aradzie commented Mar 20, 2025

George, thank you for your contribution!
I know from experience that creating a quality word list is a tedious and time consuming task.
(I can't count how many hours I spent on the English list.)
That being said, the file that you are modifying is automatically generated, your changes will be lost, and we use a different approach.
It's a bit complicated, so let me explain.

First, we have a separate repository where we develop the word frequency dictionaries.
I compared your changes with the existing file, the list sizes are the same, so I assume you only fixed the existing words, and did not insert or delete words.
Based on this assumption I replaced the words in the original corpus repository.
Next, I passed the words thought an automatic spell checker to filter out any invalid words.
Then I created a spreadsheet in Google docs.
This is the usual approach, I create a spreadsheet for a native speaker to review. It seems to be easier this way because it does not require any coding from the reviewer.

The spreadsheet has three columns. If the third column is not empty, then it contains the original word before you fixed it, and such a row is highlighted in a separate color.
There is also a separate sheet rejected with the words rejected by the spell checker. You may want to take a look at it.

So, let me ask you to review the spreadsheet.
If you find a spelling error, just fix the word in the first column.
If you find an invalid word, then delete the row or simply blank out the cell. (Put the cursor on the invalid word then press Backspace.)

If you feel enthusiastic, the please also remove any potentially triggering words. As an example take a look at the blacklist files for the English language -- profanity.txt and sensitive.txt

Thanks again!

@aradzie
Copy link
Owner

aradzie commented Mar 20, 2025

Hey, I had an idea. I took the original English blacklist of sensitive words and asked an AI bot to translate it to Greek. The translation may not be perfect, but it's a start. Here's the result

The sensitive words are any words about race, ideology, religion or sex. They can create random combinations that can trigger some people.

Again, if you don't feel too enthusiastic, just skip this part.

So, please take a look at the updated spreadsheet. Let me know when you are done then I'll update the web site right away.

@aradzie
Copy link
Owner

aradzie commented Mar 20, 2025

I just found out that you can use the built-in Google Docs spell checker to review words quickly. It's a good start to automatically check the words before doing any manual work. The spell checker is available in the menu "Tools" -> "Spelling" -> "Spell check".

@agoatboi
Copy link
Author

agoatboi commented Mar 20, 2025

Hi aradzie,

Thanks a lot for your enthusiastic response and wealth of information!

I actually understood I had modified the wrong file a little after finishing my (20hour plus 💀) run, when I tried to run the development server locally and, obviously, the changes weren't integrated. It was at this point that I marked the PR as a draft, hoping to work on it over the weekend. I now realize I should have also communicated this in the comments of the PR, but honestly it seemed like it was almost ready.

Unfortunately, what you have in the spreadsheet with the 1-1 mapping won't quite work, because I replaced some seemingly frequent -but wrong- words with rather rarer ones. When I realized that word frequencies had to be involved, I postponed this until I could find a suitable dataset to estimate them, or to instead sort them by hand based on a heuristic/intuitive sense. This will take me some time to do still, but if you have any good ideas I'm happy to hear them.

I am now at the point where I've read the documentation further and decided to update both the layout and language to add the διαλυτικα με τονο symbol appearing in a few words, but I can't quite get it to work. It seems as though the alphabet variable is cached somewhere, because not even by deleting symbols from it can I get the profile statistics page to update. I got lost between the various nested components trying to understand how it works, and regretfully, I'm not proficient in JS.

Could I perhaps push the commits I have here so that you can guide me on how to do it? I think I'm very close.

As for the sensitive words, I found out a few while going through the list, and in fact added very few of them myself since they (rather unsurprisingly) show up often in Greek texts and occasionally have unique trigrams the algorithm could help you get used to. I am willing to go through the list again and mark those out for you. Perhaps their inclusion could be an optional setting to be toggled as desired? I am not sure if that is already implemented.

@agoatboi
Copy link
Author

Force pushed the commit since I'm removing the original about changing the json of words. I'll instead make a PR to the corpus repository with updated words and frequencies when I can muster to courage to estimate the latter.

@aradzie
Copy link
Owner

aradzie commented Mar 20, 2025

There must be a way to salvage your work. 20 hours is a lot of time. You changed around 1076 words in the original list. I think we can assume that the remaining ~9000 words are valid.

I added a new sheet with only changed words. Maybe you can review this smaller list? If a word is misspelled, fix it. If it is invalid, make it blank. I will find a way to reconcile your changes in the spreadsheet to the corpus repository.

Speaking of the corpus repository. The original word frequency dictionary is el_50k.csv. (I think it comes from parsing a movie subtitles database, but I don't remember exactly.) It has around 50000 words, so it's not a problem to cull it aggressively, there is a plenty of room.

If you want to add a pull request, then it can be as simple as a text file with invalid words named lang-el/blackilst-xyz.txt. The corpus repository is a bunch of one time throwaway scripts. For this reason I do not pay too much attention to the quality of the scripts, and you probably shouldn't either ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants