Skip to content

Datasets

Seid Muhie Yimam edited this page Nov 5, 2021 · 13 revisions

Introduction

On this page, we will list and discuss the different datasets and corpora used to build the different semantic and NLP models for Amharic.

Corpus


NLP Datasets


POS tagging

The POS tagged benchmark dataset is prepared from the work of Gashaw and Shashirekha 2018. Below are the different training, development, and test set splits

Type number of sentences
Training set 29521
Development set 1678
Test set 1687
Clone this wiki locally