Collection of Thai Natural Language Processing (NLP) software libraries, dictionaries, and corpus. Always welcome for pull requests.
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
JTCC | Thai Character Cluster | Java | GPL-3.0 | Wittawat | |
TCC | Thai Character Cluster | Python | Apache 2.0 | Wannaphong |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
sentiment_analysis_thai | JagerV3 |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
LK82 + Udom83 | Thai Soundex | Python | Korakot |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
Swath | SWATH (Smart Word Analysis for THai) is a word segmentation for Thai | C | Longest Matching, Maximal Matching and Part-of-Speech Bigram. | GPL | CMU |
Lexto | Lexto: Thai Lexeme Tokenizer | Java | LGPL | NECTEC | |
Python 2 | LGPL | Python2 Wrapper | |||
Python 3 | LGPL | Python3 Wrapper | |||
Wordcut | Thai word breaker for Node.js | JavaScript, Node.JS | LGPL-3.0 | veer66, github | |
wordcutpy | A simple Thai word tokenizer written in 1 Python file | Python 3 | LGPL-3.0 | veer66, github | |
CutKum | Thai Word-Segmentation with Deep Learning in Tensorflow. RNN. | Python | 93% F-measure. | MIT | Pucktada, github |
Thai Language Toolkit (tltk) | Based on a paper by Wirote Aroonmanakun in 2002. Word segmentation is based on a maximum collocation approach. Syllable segmentation is based on 3grams statistics. (Dataset is included) | Python | 97.86% F-measure. (It was tested on a different testset; it is not fair to compare it with other models.) | GPLv3 | awirote, the Python Package Index |
DeepCut | A Thai word tokenization library using Deep Neural Network. CNN. | Python | 98.8% F-measure. | MIT | rkcosmos, github |
SynThai | Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM. | Python | 99.2% F-measure | MIT | KenjiroAI, github |
CutThai | Thai word segmentation written in coffee-script Edit | Coffee-script | MIT | Pureexe/cutthai Github | |
Multi-Candidate-Word-Segmentation | Multi Candidate Word Segmentation for Thai language | Python, RNN, LSTM | 97.0% F-measure (Word Level), 98.95% F-measure (Boundary Level) | MIT | Paper, earthy123/Multi-Candidate-Word-Segmentation |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
Jitar+NAiST | A simple Trigram HMM part-of-speech tagger | Java | Ver66, Jitar + NAiST, 1 + NAiST, 2 | ||
SynThai | Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM. | Python | 0.9163 F-measure. RNN. LSTM | MIT | KenjiroAI, github |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
Named Entity Tagging (Thai NEST) | Thai Named Entity tagging Specification and Tools | GPL | KINDML, SIIT, AIAT |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
News Structure Tagging Program | Thai News Structure Tagging Program | Metadata tagging, Structure tagging, Automatic News Title Generation | GPL | AIAT |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
Chart-parser | Extract Syntactic Structure from POS Tagged Sentence. | C | All rights reserved | Thanaruk T. ([email protected]) | |
Grammar Processing | Labelled Brackets -> Context Free Grammars (CFGs) | Python | Transform and compute probability | Thodsaporn C. |
Library | Description | Programming Languages | Features | License | Author & Link |
---|---|---|---|---|---|
kobkrit-word-embedding | Tensorflow implementation of Thai word embedding | Python | Source code, Example, Word distance graph | LGPL | Kobkrit V. |
Service | Description | License | Author & Link |
---|---|---|---|
Thai Machine Comprehension (ThaiMC) | Bidirectional Attention Flow | Copyright (As the service) | iApp-AI |
Library | Description | Size | Features | License | Link |
---|---|---|---|---|---|
Transliteration Corpus | 31K pairs | Thai-Eng Translation Pair | CC BY-NC-SA 3.0 TH | NECTEC | |
LEXiTRON | Thai<->English Dictionary | TH->EN, EN->TH | LEXiTRON License | NECTEC | |
Yaitron | LEXiTRON in machine readable format (XML) | TH->EN, EN->TH | LEXiTRON License | Veer66 Schema, Data & Conversion Code |
Library | Description | Size | Features | License | Link |
---|---|---|---|---|---|
ORCHID | 30K sent. | Word Seg., POS Tagged. | CC BY-NC-SA 3.0 TH | NECTEC | |
InterBEST 2009/2010 | 5M words | Word Seg. | CC BY-NC-SA 3.0 TH | NECTEC | |
Thai Wikipedia | Formal Articles | 1.49GB (~213.1 MB compressed) | XML | GFDL | WIKIPEDIA |
TNC Top-5000 Words | Word frequency | 5,000 words | Frequency of Thai words in various genres, EXCEL | All rights reserved | CHULA |
Click Bait Sentences | Thai Click Bait Sentence | 330 sent. (90.7KB) | MIT | Wannaphongcom | |
Thai Sentimental Word List | Thai Sentimental Words List | 52KB | Seperated Words as Adj, V | MIT | Wannaphongcom |
Prime Minister 29 | Prime Minister 29's Speech Sentences | 338KB | Word segged, Name Entity Tagged | MIT | Wannaphongcom |
Thai named entity corpora | named entity corpora by Wirote Aroonmanakun's students | 266KB-1.5MB | syllable seg., word seg., Named Entity tagged | GPLv3(not sure, but tltk is using this license) | นัชชา ถิระสาโรช Data ศศิวิมล กาลันสีมา Data ณัฐดาพร เลิศชีวะ Data |
Thai WordNet | THE CONSTRUCTION OF THAI WORDNET OF 1ST ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD AND WITH DICTIONARIES OF DIFFERENT COMPILATIONAL APPROACHES(ธนนท์ หลีน้อย) THE CONSTRUCTION OF THAI WORDNET OF 2ND ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD : A STUDY OF THE DIVERSITY OF MEANINGS AFFECTING TRANSLATIONAL ACCURACY (ปริศนา อัครพุทธิพร) |
WordNet | N/A | ธนนท์ หลีน้อย 2008 ปริศนา อัครพุทธิพร Data 2008 |
|
Toxicity in Thai Tweet Corpus | Tokyo Metropolitan University Natural Language Processing Group | Each tweet is labeled as toxic or non-toxic | CC BY-NC 4.0 | tmu-nlp |
Library | Description | Size | Features | License | Link |
---|---|---|---|---|---|
Thai National Corpus 2 | 32M words | Query text by genre, domain | All rights reserved | CHULA | |
Thai Medical Document | 3,594 docs | Document and dynamic keyword map | All rights reserved | KINDML, SIIT | |
Southeast Asian Languages Library | Thai News, Web Text, Pop Music, Literature, Toponyms | 20M chars | Phase around a search text | SEALang | |
HSE Thai Corpus | Modern texts written in Thai language (mostly news websites) | 50M tokens | Query by word form, lexeme, translation, grammatical attributes, lexical attributees | HSE School of Linguistics |
Pre-trained Model | Description | Size | Dimensions | License | Link |
---|---|---|---|---|---|
fastText | Skip-Gram model trained on Wikipedia using fastText | 300 | CC BY-SA 3.0 | Facebook + Bin & Text + Text Only | |
thai2vec v0.2 | ULMFit on Wikipedia. Perplexity of 34.9 with 60,002 embeddings. | 70MB | 300 | MIT | thai2vec / pyThaiNLP |
Model | Description | Dataset | Accuracy | License | Link |
---|---|---|---|---|---|
thai2vec v0.1 | ULMFit | BEST | 94.4% | MIT | thai2vec / pyThaiNLP |
thai2vec v0.2 | ULMFit | Wongnai Challenge | 62.7% | MIT | thai2vec / pyThaiNLP |
- Arthit (https://www.facebook.com/arthit) - For suggestions on license words.
- C4N (https://github.com/kobkrit/nlp_thai_resources/commits/master/README.md?author=c4n)
- Veer66 (https://github.com/kobkrit/nlp_thai_resources/commits/master/README.md?author=veer66)
- Bi89 (https://github.com/kobkrit/nlp_thai_resources/commits/master/README.md?author=bi89)
- Tchayintr (https://github.com/kobkrit/nlp_thai_resources/commits/master/README.md?author=tchayintr)
- PureEXE (https://github.com/kobkrit/nlp_thai_resources/commits/master/README.md?author=pureexe)
- Cstorm125 (https://github.com/kobkrit/nlp_thai_resources/commits/master/README.md?author=cstorm125)