Skip to content

Datasets

Tadesse Destaw edited this page Nov 5, 2022 · 13 revisions

Introduction

On this page, we will list and discuss the different datasets and corpora used to build the different semantic and NLP models for Amharic.


Corpus

Details about the corpus that is used to build the different semantic models are available here. You can download the corpus from the Mendeley Dataset Repository


NLP Datasets


Named Entity Recognition

The Amharic named entity dataset is annotated within the SAY project at New Mexico State University’s Computing Research Laboratory was used. The data is annotated with six classes, namely person, location, organization, time, title, and others.

There are a total of 4237 sentences where 5480 tokens out of 109,676 tokens are annotated as named entities. The dataset is represented in XML format (for the different named entity classes) and is openly available in this GitHub repository. We have converted the data into the CONLL data format

This is the first benchmark dataset (split into train/development/test sets) with the SOTA results based on our paper

The dataset can be downloaded from here


POS tagging

The POS tagged benchmark dataset is prepared from the work of Gashaw and Shashirekha (2018). Below are the different training, development, and test set splits

Type number of sentences
Training set 29521
Development set 1678
Test set 1687

Amharic Sentiment classification

ASAB: is the first of its kind to conduct surveys based on a specific reward scheme, which is mobile card vouchers. The datasets, codes, and annotation tools for Amharic sentiment analysis are described in this GitHub Page

You can read our paper entitled Exploring Amharic Sentiment Analysis from Social Media Texts: Building Annotation Tools and Classification Models for more details.


Amharic Question Classification

We presented the first data on the Amharic question classification task. The data is collected from @AskAnythingEthiopia Telegram public channel. Below are the different types of Amharic question classification datasets, Amharic and Amharic that are transliterated from English scripts.

Both Amharic and the transliterated dataset are split into train and test. Also this here file contains both the Latin script and the transliterated. If you want to reproduce our results, use the split transliterated train and test.

Question Type Number of Questions
Amharic 7,967
Translitrated 23,541

Amharic⇔English Machine Translation

This MT data is the first of large scale MT data.

You can read our paper entitled The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation for more details.