Skip to content

Datasets

Seid Muhie Yimam edited this page Nov 19, 2021 · 13 revisions

Introduction

On this page, we will list and discuss the different datasets and corpora used to build the different semantic and NLP models for Amharic.


Corpus

Details about the corpus that is used to build the different semantic models are available here. You can download the corpus from the Mendeley Dataset Repository


NLP Datasets


Named Entity Recognition

The Amharic named entity dataset is annotated within the SAY project at New Mexico State University’s Computing Research Laboratory was used. The data is annotated with six classes, namely person, location, organization, time, title, and others.

There are a total of 4237 sentences where 5480 tokens out of 109,676 tokens are annotated as named entities. The dataset is represented in XML format (for the different named entity classes) and is openly available in this GitHub repository. We have converted the data into the CONLL data format

This is the first benchmark dataset (split into train/development/test sets) with the SOTA results based on our paper

The dataset can be downloaded from here


POS tagging

The POS tagged benchmark dataset is prepared from the work of Gashaw and Shashirekha (2018). Below are the different training, development, and test set splits

Type number of sentences
Training set 29521
Development set 1678
Test set 1687

Amharic Sentiment classification

ASAB: is the first of its kind to conduct surveys based on a specific reward scheme, which is mobile card vouchers. The datasets, codes, and annotation tools for Amharic sentiment analysis are described in this GitHub Page

You can read our paper entitled Exploring Amharic Sentiment Analysis from Social Media Texts: Building Annotation Tools and Classification Models for more details.

Clone this wiki locally