Skip to content

mi-erasmusmc/Mantra-Gold-Standard-Corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Mantra-Gold-Standard-Corpus

To create a multilingual gold-standard corpus for biomedical concept recognition, we selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated pre-annotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations.

The number of final annotations was 5,530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language.

The use of automatic pre-annotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques.

To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated.

Availability The Mantra GSC is available in brat format and also in XML format.

Citing the Mantra Corpus If you have used the corpus in your study, please cite:

Kors JA, Clematide S, Akhondi SA, van Mulligen EM, Rebholz-Schuhmann D. (2015) A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC. J Am Med Inform Assoc 2015;0:1–11. doi:10.1093/jamia/ocv037

Releases

No releases published

Packages

No packages published