-
Notifications
You must be signed in to change notification settings - Fork 9
Datasets
On this page, we will list and discuss the different datasets and corpora used to build the different semantic and NLP models for Amharic.
Details about the corpus that is used to build the different semantic models are available here. You can download the corpus from the Mendeley Dataset Repository
The Amharic named entity dataset is annotated within the SAY project at New Mexico State University’s Computing Research Laboratory was used. The data is annotated with six classes, namely person
, location
, organization
, time
, title
, and others
.
There are a total of 4237
sentences where 5480
tokens out of 109,676
tokens are annotated as named entities. The dataset is represented in XML format
(for the different named entity classes) and is openly available in this GitHub repository. We have converted the data into the CONLL
data format
This is the first benchmark dataset (split into train/development/test sets) with the SOTA results based on our paper
The dataset can be downloaded from here
The POS tagged benchmark dataset is prepared from the work of Gashaw and Shashirekha (2018). Below are the different training, development, and test set splits
Type | number of sentences |
---|---|
Training set | 29521 |
Development set | 1678 |
Test set | 1687 |
ASAB: is the first of its kind to conduct surveys based on a specific reward scheme, which is mobile card vouchers. The datasets, codes, and annotation tools for Amharic sentiment analysis are described in this GitHub Page
You can read our paper entitled Exploring Amharic Sentiment Analysis from Social Media Texts: Building Annotation Tools and Classification Models for more details.
We presented the first data on the Amharic question classification task. The data is collected from @AskAnythingEthiopia Telegram public channel. Below are the different types of Amharic question classification datasets, Amharic and Amharic that are transliterated from English scripts.
Both Amharic and the transliterated dataset are split into train and test. Also this here file contains both the Latin script and the transliterated. If you want to reproduce our results, use the split transliterated train and test.
Question Type | Number of Questions |
---|---|
Amharic | 7,967 |
Translitrated | 23,541 |
This MT data is the first of large scale MT data.
You can read our paper entitled The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation for more details.