mmCRFseg: Myanmar Word Segmentation using Conditional Random Fields

Word Segmentation for Myanmar Language using Conditional Random Fields with character tagging - 1 for beginning of sentences and 0 for others.

ဪ ၊ အခန်း နံပါတ် ၃၀၅ မှာ တည်း လို့ ရ ပါ တယ် ။ အေးအေးဆေးဆေး ရှိ ပါ တယ် ။

[('ဪ', 0), ('၊', 1), ('အ', 1), ('ခ', 0), ('န', 0), ('်', 0), ('း', 0), ('န', 1), ('ံ', 0), ('ပ', 0), ('ါ', 0), ('တ', 0), ('်', 0), ('၃', 1), ('၀', 0), ('၅', 0), ('မ', 1), ('ှ', 0), ('ာ', 0), ('တ', 1), ('ည', 0), ('်', 0), ('း', 0), ('လ', 1), ('ိ', 0), ('ု', 0), ('့', 0), ('ရ', 1), ('ပ', 1), ('ါ', 0), ('တ', 1), ('ယ', 0), ('်', 0), ('။', 1), ('အ', 1), ('ေ', 0), ('း', 0), ('အ', 0), ('ေ', 0), ('း', 0), ('ဆ', 0), ('ေ', 0), ('း', 0), ('ဆ', 0), ('ေ', 0), ('း', 0), ('ရ', 1), ('ှ', 0), ('ိ', 0), ('ပ', 1), ('ါ', 0), ('တ', 1), ('ယ', 0), ('်', 0), ('။', 1)]

	Scores
Precision	0.959
Recall	0.953
Accuracy	0.980

It was trained and tested on myPOS corpus version 3.0[1] and additional data collected from the internet and other Open resources. Win Pa Pa et.al [2] also studied the CRF approach but in this tool, we used different feature sets for better segmentation results.

Other useful text segmentation tools for Myanmar Langauge

myWord[3]: syllable, word, sub_word and phrase segmenter for Burmese
- Regex-based for Syllable
- Dictionary-based for Word
- Unsupervised learning based for Phrase
Myan-word-breaker[4]: Statistical approach based word segmentation
pyidaungsu[5]: Python text processing library for Myanmar language

Download repo

git clone https://github.com/ThuraAung1601/mm-Word-Seg

Install requirement

pip install -r python-crfsuite

How to run

တစ်ဖိုင်လုံး Segmentation လုပ်ချင်တဲ့ ဖိုင် ရှိရင် ဒီလို အသုံးပြုပါ။

python ./seg.py -i test.txt -o out.txt

$ python seg.py --help
usage: seg.py [-h] [-i INPUT] [-o OUTPUT]

Word Segmentation for Burmese language using CRF model

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        input file
  -o OUTPUT, --output OUTPUT
                        output file

တခြား Program တွေထဲ ထည့်သုံးချင်ရင် ဒီလို ခေါ် သုံးလို့ ရပါတယ်။

from seg import segment_word
sentence = "ဒီနေ့ကမ္ဘာ့စံချိန်သစ်တင်ခဲ့တယ်။"
segment_word(sentence)

Output: "ဒီ နေ့ ကမ္ဘာ့ စံချိန် သစ်တင် ခဲ့ တယ် ။"

from seg import segment_word
sentence = "မိုးရွာထားလို့အခန်းထဲမှာစိုထိုင်းထိုင်းဖြစ်နေတယ်"
segment_word(sentence)

Output: "မိုးရွာ ထား လို့ အခန်း ထဲ မှာ စိုထိုင်း ထိုင်း ဖြစ် နေ တယ်"

from seg import segment_word
sentences = "မောက်ချိုက်အရသာလေးက ပေါ့တဲ့ဘက်သွားတော့ အငန်မစားတဲ့သူဆို အတော်ပဲ ဆိုင်ကတမင် အငန်မစားတဲ့သူတွေအဆင်ပြေအောင်လျော့ထားပုံရတယ်"
segment_word(sentences)

Output: 'မောက်ချိုက် အရသာလေး က ပေါ့ တဲ့ ဘက် သွား တော့ အငန် မ စား တဲ့ သူ ဆို အတော် ပဲ ဆိုင် ကတမင် အငန် မ စား တဲ့ သူ တွေ အဆင်ပြေ အောင် လျော့ ထား ပုံ ရ တယ်'

from seg import segment_word
sentences = "ကိုရိုနာဗိုင်းရပ်စ် ဆိုတာ လူနဲ့ တခြားတိရစ္ဆာန်တွေကို ကူးစက်တတ်တဲ့ ဗိုင်းရပ်စ် တစ်မျိုး ဖြစ်ပါတယ် အကြမ်းအားဖြင့် လူသားတွေအကြားမှာ ကူးစက်ပြန့်ပွားပြီး ချောင်းဆိုးတာ ကိုယ်ပူတာနဲ့ နှာရည်ယိုတာတွေလို သာမန် အဖျားရောဂါနဲ့ ဆင်တူတဲ့ လက္ခဏာတွေကို ဖြစ်စေပါတယ်"
segment_word(sentences)

Output: ကိုရိုနာ ဗိုင်းရပ်စ် ဆို တာ လူ နဲ့ တခြား တိရစ္ဆာန် တွေ ကို ကူးစက် တတ် တဲ့ ဗိုင်းရပ်စ် တစ် မျိုး ဖြစ် ပါတယ် အကြမ်း အားဖြင့် လူသား တွေ အကြား မှာ ကူးစက် ပြန့်ပွား ပြီး ချောင်းဆိုးတာ ကိုယ်ပူတာ နဲ့ နှာရည် ယို တာ တွေ လို သာမန် အဖျား ရောဂါ နဲ့ ဆင်တူ တဲ့ လက္ခဏာ တွေ ကို ဖြစ် စေ ပါတယ်

References

[1] Ye Kyaw Thu, "myPOS (Myanmar Part-of-Speech) Corpus for Myanmar NLP Research and Developments", GitHub Link: https://github.com/ye-kyaw-thu/myPOS

[2] Win Pa Pa, Ye Kyaw Thu, Andrew M. Finch and Eiichiro Sumita. “Word Boundary Identification for Myanmar Text Using Conditional Random Fields.” ICGEC (2015).

[3] myWord: Syllable, Word, and Phrase Segmenter for Burmese, Ye Kyaw Thu, Sept 2021, GitHub Link: https://github.com/ye-kyaw-thu/myWord

[4] Myan-word-breaker: Nay Lin Aung, Github, Link: https://github.com/stevenay/myan-word-breaker

[5] pyidaungsu: Kaung Htet San, Github, Link: https://github.com/kaunghtetsan275/pyidaungsu

Citation

myCRFSeg: Word Segmentation Tool for Myanmar Language using Binary Tagging based Conditional Random Fields, Thura Aung, July 2022, GitHub Link: https://github.com/ThuraAung1601/mmCRFseg

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
model		model
notebook		notebook
LICENSE		LICENSE
README.md		README.md
seg.py		seg.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mmCRFseg: Myanmar Word Segmentation using Conditional Random Fields

Other useful text segmentation tools for Myanmar Langauge

Download repo

Install requirement

How to run

References

Citation

About

Releases

Packages

Languages

License

ThuraAung1601/mmCRFseg

Folders and files

Latest commit

History

Repository files navigation

mmCRFseg: Myanmar Word Segmentation using Conditional Random Fields

Other useful text segmentation tools for Myanmar Langauge

Download repo

Install requirement

How to run

References

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages