Skip to content

Curated list of all AI related resources in Odia Language.

License

Notifications You must be signed in to change notification settings

odisha-ml/Awesome-Odia-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome-Odia-AI

Awesome Check Markdown links

Curated list of all AI related resources in Odia Language.

Table of Contents

NLP

Translation

Transliteration

Language Understanding

Datasets

  • IndicCorp: Large sentence-level monolingual corpora for 11 Indian languages and Indian English containing 8.5 billions words (250 million sentences) from multiple news domain sources. [paper][code][web]
  • Naamapadam: Training and evaluation datasets for named entity recognition in multiple Indian language. [paper][huggingface][web]
  • IndicCorp v2: he largest collection of texts for Indic languages consisting of 20.9 billion tokens of which 14.4B tokens correspond to 23 Indic languages and 6.5B tokens of Indian English content curated from Indian websites. [paper][code]

Language Model

  • Language Model : Pretrained Odia Language Model.
  • BertOdia : Bert-based Odia Language Model.
  • IndicBERT: Multilingual, compact ALBERT language model trained on IndicCorp covering 11 major Indian and English. Small model (18 million parameters) that is competitive with large LMs for Indian language tasks. [paper][code][web]
  • IndicNER: Named Entity Recognizer models for multiple Indian languages. The models are trained on the Naampadam NER dataset mined from Samanantar parallel corpora. [paper][huggingface][web]
  • IndicBERTv2: Language model trained on IndicCorp v2 with competitive performance on IndicXTREME [paper][code][web]

Word Embedding

  • FastText (CommonCrawl + Wikipedia) : Pretrained Word vector (CommonCrawl + Wikipedia). Trained on Common Crawl and Wikipedia using fastText. Select the language "oriya" from the model list.
  • FastText (Wikipedia) : Pretrained Word vector (Wikipedia). Trained on Wikipedia using fastText. Select the language "oriya" from the model list.
  • IndicFT: Word embeddings for 11 Indian languages trained on IndicCorp. The embeddings are based on the fastText model and are well suited for the morphologically rich nature of Indic languages. [paper][code][web]

Morphanalyzers

  • IndicNLP Morphanalyzers : Unsupervised morphanalyzers for 10 Indian languages including Odia learnt using morfessor.

Language Generation

Instruction Set

  • Odia master data llama2: This dataset contains 180k Odia instruction sets translated from open-source instruction sets and Odia domain knowledge instruction sets.
  • Odia context 10k llama2 set: This dataset contains 10K instructions that span various facets of Odisha's unique identity. The instructions cover a wide array of subjects, ranging from the culinary delights in 'RECIPES,' the historical significance of 'HISTORICAL PLACES,' and 'TEMPLES OF ODISHA,' to the intellectual pursuits in 'ARITHMETIC,' 'HEALTH,' and 'GEOGRAPHY.' It also explores the artistic tapestry of Odisha through 'ART AND CULTURE,' which celebrates renowned figures in 'FAMOUS ODIA POETS/WRITERS', and 'FAMOUS ODIA POLITICAL LEADERS'. Furthermore, it encapsulates 'SPORTS' and the 'GENERAL KNOWLEDGE OF ODISHA,' providing an all-encompassing representation of the state.
  • Roleplay Odia: This dataset contains 1k Odia role play instruction set in conversation format.
  • OdiEnCorp translation instructions 25k: This dataset contains 25k English-to-Odia translation instruction set.

Pe-train Dataset

  • CulturaX: It is a multilingual dataset contains monolingual data for several Indic languages (Hindi, Bangla, Tamil, Malayalam, Marathi, Telugu, Kannada, Gujarati, Punjabi, Odia, Assamese, etc.). Paper
  • Varta: The dataset contains 41.8 million news articles in 14 Indic languages and English, crawled from DailyHunt, a popular news aggregator in India that pulls high-quality articles from multiple trusted and reputed news publishers.

Foundation LLM

  • Qwen 1.5 Odia 7B: This is a pre-trained Odia large language model with 7 billion parameters, and it is based on Qwen 1.5-7B. The model is pre-trained on the Culturex-Odia dataset, a filtered version of the original CulturaX dataset for Odia text. As per the authors, it is a model is a base model and not meant to be used as is. It is recommended to first finetune it on downstream tasks. Blog

Fine-Tuned LLM

Benchmarking Set

  • Airavata Evaluation Suite: A collection of benchmarks used for evaluation of Airavata, a Hindi instruction-tuned model on top of Sarvam's OpenHathi base model.
  • Indic LLM Benchmark: A collection of LLM benchmark data in Gujurati, Nepali, Malayalam, Hindi, Telugu, Marathi, Kannada, Bengali.

Text Dataset

Parallel Translation Corpus

  • OdiEnCorp 2.0 : This dataset contains 97K English-Odia parallel sentences and serving in WAT2020 for Odia-English machine translation task. Paper
  • OPUS Corpus : It contains parallel sentences of other languages with Odia. The collection of data are domain-specific and noisy.
  • OdiEnCorp 1.0 : This dataset contains 30K English-Odia parallel sentences. Paper
  • IndoWordnet Parallel Corpus : Parallel corpora mined from IndoWordNet gloss and/or examples for Indian-Indian language corpora (6.3 million segments, 18 languages including Odia). Paper
  • PMIndia : Parallel corpus for En-Indian languages mined from Mann ki Baat speeches of the PM of India. It contains 38K English-Odia parallel sentences.Paper
  • CVIT PIB : Parallel corpus for En-Indian languages mined from press information bureau website of India. It contains 60K English-Odia parallel sentences.
  • Samanantar is the largest publicly available parallel corpora collection for Indic languages. The corpus has 49.6M sentence pairs between English to Indian Languages.
  • BPCC is a comprehensive and publicly available parallel corpus containing a mix of Human labelled data and automatically mined data; totaling to approximately 230 million bitext pairs[Paper]].

Monolingual Corpus

Lexical Resources

  • IndoWordNet : Wordnet for Indian languages including Odia.

POS Tagged corpus

Dialect Detection corpus

Text Classification

  • Odia News Article Classification : This dataset contains approxmately 19,000 news article headlines collected from Odia news websites. The labeled dataset is splitted into training and testset suitable for supervised text classification.
  • AI4Bharat IndicNLP News Articles : This datasets comprising news articles and their categories for 9 languages including Odia. For Odia language, it has 4 classes (business, crime, entertainment, sports) and each class contains 7.5K news articles. The dataset is balanced across classes. Paper

NLP Libraries / Tools

Other NLP Resources

  • TDIL : It contains language application, resources, and tools for Indian languages including Odia. It contains many language applications, resources, and tools for Odia such as Odia terminology application, Odia language search engine, wordnet, English-Odia parallel text corpus, English-Odia machine-assisted translation, text-to-speech software, and many more.

Audio

Speech Recognition

Text-to-Speech

Speech Dataset

  • IIT Madras IndicTTS : The Indic TTS project develops the text-to-speech (TTS) synthesis system for Indian languages including Odia. The database contains spoken sentences/utterances recorded by both Male and Female native speakers.
  • LDC-IL : It includes Odia annotated speech corpora which has voices of 450 different native speakers.
  • Mozilla Common Voice : The Mozilla Common Voice project is a community-led project to build a large multilingual dataset for speech recognition.

Computer Vision

OCR

  • Indic-OCR : OCR tools for Indic scripts including Odia. Also, supports Ol Chiki (Santali).

Events

Community