-
Notifications
You must be signed in to change notification settings - Fork 9
Datasets
On this page, we will list and discuss the different datasets and corpora used to build the different semantic and NLP models for Amharic.
The Amharic named entity dataset is annotated within the SAY project at New Mexico State University’s Computing Research Laboratory was used. The data is annotated with six classes, namely person
, location
, organization
, time
, title
, and others
.
There are a total of 4237
sentences where 5480
tokens out of 109,676
tokens are annotated as named entities. The dataset is represented in XML format
(for the different named entity classes) and is openly available in this GitHub repository. We have converted the data into the CONLL
data format
This is the first benchmark dataset (split into train/development/test sets) with the SOTA results based on our paper
The dataset can be downloaded from here
The POS tagged benchmark dataset is prepared from the work of Gashaw and Shashirekha 2018. Below are the different training, development, and test set splits
Type | number of sentences |
---|---|
Training set | 29521 |
Development set | 1678 |
Test set | 1687 |
ASAB: is the first of its kind to conduct surveys based on a specific reward scheme, which is mobile card vouchers. The datases, codes, and annotation tools for Amharic sentiment analysis are described in this GitHub Page
You can read our paper entitled Exploring Amharic Sentiment Analysis from Social Media Texts: Building Annotation Tools and Classification Models for more details.