Thai Natural Language Processing (Thai NLP) Resource

Collection of Thai Natural Language Processing (NLP) software libraries, dictionaries, and corpus. Always welcome for pull requests.

Thai NLP Libraries/Services

Thai Character Cluster

Library	Description	Programming Languages	Features	License	Author & Link
JTCC	Thai Character Cluster	Java		GPL-3.0	Wittawat
TCC	Thai Character Cluster	Python		Apache 2.0	Wannaphong

Thai Sentiment Analysis

Library	Description	Programming Languages	Features	License	Author & Link
sentiment_analysis_thai					JagerV3

Thai Soundex

Library	Description	Programming Languages	Features	License	Author & Link
LK82 + Udom83	Thai Soundex	Python			Korakot

Word Segmentation

Library	Description	Programming Languages	Features	License	Author & Link
Swath	SWATH (Smart Word Analysis for THai) is a word segmentation for Thai	C	Longest Matching, Maximal Matching and Part-of-Speech Bigram.	GPL	CMU
Lexto	Lexto: Thai Lexeme Tokenizer	Java		LGPL	NECTEC
		Python 2		LGPL	Python2 Wrapper
		Python 3		LGPL	Python3 Wrapper
Wordcut	Thai word breaker for Node.js	JavaScript, Node.JS		LGPL-3.0	veer66, github
wordcutpy	A simple Thai word tokenizer written in 1 Python file	Python 3		LGPL-3.0	veer66, github
CutKum	Thai Word-Segmentation with Deep Learning in Tensorflow. RNN.	Python	93% F-measure.	MIT	Pucktada, github
Thai Language Toolkit (tltk)	Based on a paper by Wirote Aroonmanakun in 2002. Word segmentation is based on a maximum collocation approach. Syllable segmentation is based on 3grams statistics. (Dataset is included)	Python	97.86% F-measure. (It was tested on a different testset; it is not fair to compare it with other models.)	GPLv3	awirote, the Python Package Index
DeepCut	A Thai word tokenization library using Deep Neural Network. CNN.	Python	98.8% F-measure.	MIT	rkcosmos, github
SynThai	Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM.	Python	99.2% F-measure	MIT	KenjiroAI, github
CutThai	Thai word segmentation written in coffee-script Edit	Coffee-script		MIT	Pureexe/cutthai Github
Multi-Candidate-Word-Segmentation	Multi Candidate Word Segmentation for Thai language	Python, RNN, LSTM	97.0% F-measure (Word Level), 98.95% F-measure (Boundary Level)	MIT	Paper, earthy123/Multi-Candidate-Word-Segmentation

Part of Speech Tagging (POS Tagging)

Library	Description	Programming Languages	Features	License	Author & Link
Jitar+NAiST	A simple Trigram HMM part-of-speech tagger	Java			Ver66, Jitar + NAiST, 1 + NAiST, 2
SynThai	Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM.	Python	0.9163 F-measure. RNN. LSTM	MIT	KenjiroAI, github

Name Entity Recognition

Library	Description	Programming Languages	Features	License	Author & Link
Named Entity Tagging (Thai NEST)	Thai Named Entity tagging Specification and Tools			GPL	KINDML, SIIT, AIAT

News Structure Tagging

Library	Description	Programming Languages	Features	License	Author & Link
News Structure Tagging Program	Thai News Structure Tagging Program		Metadata tagging, Structure tagging, Automatic News Title Generation	GPL	AIAT

Syntactic Parsing & Tools

Library	Description	Programming Languages	Features	License	Author & Link
Chart-parser	Extract Syntactic Structure from POS Tagged Sentence.	C		All rights reserved	Thanaruk T. ([email protected])
Grammar Processing	Labelled Brackets -> Context Free Grammars (CFGs)	Python	Transform and compute probability		Thodsaporn C.

Thai Word Embedding

Library	Description	Programming Languages	Features	License	Author & Link
kobkrit-word-embedding	Tensorflow implementation of Thai word embedding	Python	Source code, Example, Word distance graph	LGPL	Kobkrit V.

Thai Question Answering (Machine Comprehension)

Service	Description	License	Author & Link
Thai Machine Comprehension (ThaiMC)	Bidirectional Attention Flow	Copyright (As the service)	iApp-AI

Dictionaries / Translation Pairs

Library	Description	Size	Features	License	Link
Transliteration Corpus		31K pairs	Thai-Eng Translation Pair	CC BY-NC-SA 3.0 TH	NECTEC
LEXiTRON	Thai<->English Dictionary		TH->EN, EN->TH	LEXiTRON License	NECTEC
Yaitron	LEXiTRON in machine readable format (XML)		TH->EN, EN->TH	LEXiTRON License	Veer66 Schema, Data & Conversion Code

Downloadable Text Corpus

Library	Description	Size	Features	License	Link
ORCHID		30K sent.	Word Seg., POS Tagged.	CC BY-NC-SA 3.0 TH	NECTEC
InterBEST 2009/2010		5M words	Word Seg.	CC BY-NC-SA 3.0 TH	NECTEC
Thai Wikipedia	Formal Articles	1.49GB (~213.1 MB compressed)	XML	GFDL	WIKIPEDIA
TNC Top-5000 Words	Word frequency	5,000 words	Frequency of Thai words in various genres, EXCEL	All rights reserved	CHULA
Click Bait Sentences	Thai Click Bait Sentence	330 sent. (90.7KB)		MIT	Wannaphongcom
Thai Sentimental Word List	Thai Sentimental Words List	52KB	Seperated Words as Adj, V	MIT	Wannaphongcom
Prime Minister 29	Prime Minister 29's Speech Sentences	338KB	Word segged, Name Entity Tagged	MIT	Wannaphongcom
Thai named entity corpora	named entity corpora by Wirote Aroonmanakun's students	266KB-1.5MB	syllable seg., word seg., Named Entity tagged	GPLv3(not sure, but tltk is using this license)	นัชชา ถิระสาโรช Data ศศิวิมล กาลันสีมา Data ณัฐดาพร เลิศชีวะ Data
Thai WordNet	THE CONSTRUCTION OF THAI WORDNET OF 1ST ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD AND WITH DICTIONARIES OF DIFFERENT COMPILATIONAL APPROACHES(ธนนท์ หลีน้อย) THE CONSTRUCTION OF THAI WORDNET OF 2ND ORDER ENTITY COMMON BASE CONCEPTS USING A BI-DIRECTIONAL TRANSLATION METHOD : A STUDY OF THE DIVERSITY OF MEANINGS AFFECTING TRANSLATIONAL ACCURACY (ปริศนา อัครพุทธิพร)		WordNet	N/A	ธนนท์ หลีน้อย 2008 ปริศนา อัครพุทธิพร Data 2008
Toxicity in Thai Tweet Corpus	Tokyo Metropolitan University Natural Language Processing Group		Each tweet is labeled as toxic or non-toxic	CC BY-NC 4.0	tmu-nlp

Web Query Text Corpus

Library	Description	Size	Features	License	Link
Thai National Corpus 2		32M words	Query text by genre, domain	All rights reserved	CHULA
Thai Medical Document		3,594 docs	Document and dynamic keyword map	All rights reserved	KINDML, SIIT
Southeast Asian Languages Library	Thai News, Web Text, Pop Music, Literature, Toponyms	20M chars	Phase around a search text		SEALang
HSE Thai Corpus	Modern texts written in Thai language (mostly news websites)	50M tokens	Query by word form, lexeme, translation, grammatical attributes, lexical attributees		HSE School of Linguistics

Pre-trained Word Vectors

Pre-trained Model	Description	Size	Dimensions	License	Link
fastText	Skip-Gram model trained on Wikipedia using fastText		300	CC BY-SA 3.0	Facebook + Bin & Text + Text Only
thai2vec v0.2	ULMFit on Wikipedia. Perplexity of 34.9 with 60,002 embeddings.	70MB	300	MIT	thai2vec / pyThaiNLP

Text Classification Benchmarks

Model	Description	Dataset	Accuracy	License	Link
thai2vec v0.1	ULMFit	BEST	94.4%	MIT	thai2vec / pyThaiNLP
thai2vec v0.2	ULMFit	Wongnai Challenge	62.7%	MIT	thai2vec / pyThaiNLP

Not found? Try to look at another Thai NLP Awesome List/Resource (Like this one)

http://aiat.in.th/resources/

Acknowledgements

Arthit (https://www.facebook.com/arthit) - For suggestions on license words.
C4N (https://github.com/kobkrit/nlp_thai_resources/commits/master/README.md?author=c4n)
Veer66 (https://github.com/kobkrit/nlp_thai_resources/commits/master/README.md?author=veer66)
Bi89 (https://github.com/kobkrit/nlp_thai_resources/commits/master/README.md?author=bi89)
Tchayintr (https://github.com/kobkrit/nlp_thai_resources/commits/master/README.md?author=tchayintr)
PureEXE (https://github.com/kobkrit/nlp_thai_resources/commits/master/README.md?author=pureexe)
Cstorm125 (https://github.com/kobkrit/nlp_thai_resources/commits/master/README.md?author=cstorm125)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thai Natural Language Processing (Thai NLP) Resource

Thai NLP Libraries/Services

Thai Character Cluster

Thai Sentiment Analysis

Thai Soundex

Word Segmentation

Part of Speech Tagging (POS Tagging)

Name Entity Recognition

News Structure Tagging

Syntactic Parsing & Tools

Thai Word Embedding

Thai Question Answering (Machine Comprehension)

Dictionaries / Translation Pairs

Downloadable Text Corpus

Web Query Text Corpus

Pre-trained Word Vectors

Text Classification Benchmarks

Not found? Try to look at another Thai NLP Awesome List/Resource (Like this one)

Acknowledgements

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
README.md		README.md

keyreply/Thai-NLP-Dataset

Folders and files

Latest commit

History

Repository files navigation

Thai Natural Language Processing (Thai NLP) Resource

Thai NLP Libraries/Services

Thai Character Cluster

Thai Sentiment Analysis

Thai Soundex

Word Segmentation

Part of Speech Tagging (POS Tagging)

Name Entity Recognition

News Structure Tagging

Syntactic Parsing & Tools

Thai Word Embedding

Thai Question Answering (Machine Comprehension)

Dictionaries / Translation Pairs

Downloadable Text Corpus

Web Query Text Corpus

Pre-trained Word Vectors

Text Classification Benchmarks

Not found? Try to look at another Thai NLP Awesome List/Resource (Like this one)

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages