Semantic Pleonasm Corpus (SPC), is a collection of three thousand sentences. Each sentence features a pair of potentially semantically related words (chosen by a heuristic); human annotators determine whether either (or both) of the words is redundant. The corpus offers two improvements over current resources:
- First, the corpus filters for grammatical sentences so that the question of redundancy is separated from grammaticality.
- Second, the corpus is filtered for a balanced set of positive and negative examples (i.e., no redundancy).
The negative examples may make useful benchmark data – because they all contain a pair of words that are deemed to be semantically related, a successful system cannot rely on simple heuristics, such as semantic distances, for discrimination.
Omid Kashefi, Andrew T. Lucas, Rebecca Hwa, Semantic Pleonasm Detection. Proceedings of the NAACL-HLT, pp. 225--230, New Orleans, LA, 2018. [bibtex]