This repository serves as a collection of Myanmar language datasets, focusing on both speech and text resources. Given the scarcity and difficulty in finding Myanmar language datasets, our goal is to create a centralized reference point for researchers, developers, and language enthusiasts. As Myanmar language resources are often challenging to locate, we encourage contributions from the community.
If you know of or have access to additional Myanmar language datasets not listed here, please consider contributing by submitting a pull request or opening an issue. Let's collaborate to build a comprehensive inventory of Myanmar language resources.
-
Crowdsourced high-quality Burmese speech dataset (SLR80)
- Download Page
- Download Link
- HuggingFace Original Dataset
- HuggingFace Myanmar Language Only Dataset
- Notebook (Train/Test splitting and uploading to huggingface)
-
BloomSpeech
- HuggingFace Dataset
- Notebook (Loading Myanmar Language)
- Notes: Although it's showing burmese, the actual
language='mya'
is Palaung (De'ang / Ta'ang / Riang) language.
-
Asian Language Treebank (ALT)
- Download Page
- HuggingFace Dataset
- It supports translation between following languages:
- Myanmar (Burmese) To Bengali
- Myanmar (Burmese) To English
- Myanmar (Burmese) To Filipino
- Myanmar (Burmese) To Hindi
- Myanmar (Burmese) To Bahasa Indonesia
- Myanmar (Burmese) To Japanese
- Myanmar (Burmese) To Khmer
- Myanmar (Burmese) To Lao
- Myanmar (Burmese) To Malay
- Myanmar (Burmese) To Thai
- Myanmar (Burmese) To Vietnamese
- Myanmar (Burmese) To Chinese (Simplified Chinese).
-
A Corpus of Modern Burmese
- Download Page
- You can download it directly from the current repo