Achilles is the name of the project using different n-grams to guess the language of the input givent by the user. It uses bigrams or quatergrams most of the time to build a probability distribution model. The model presents the probability of getting certain combinations of characters depending on the language. The dataset of books is freely provided by https://www.gutenberg.org/. The learning algorithm generates first the probability of getting an n-gram depending on the language and store it in a file. When asked about the language of a text, the algorithm will compare the input with the set of n-grams in the file and choose the language with the highest probability. The more dataset you give to the algorithm the better it will get. With 20 books per language we can achieve up to 94% with quatergrams.
-
Notifications
You must be signed in to change notification settings - Fork 0
allarassemjonathan/Achilles
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Language detection using a probabilistic model over using n-grams of different languages. It supports 4 languages: Spanish, French, German, and English.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published