DE-NERmed (pronounced: de:e: ner:med:) is an object-oriented named entity recognizer (NER) for German texts with a focus on the (bio)medical domain.
It is based on Apache OpenNLP and provides pre-trained, binary Maximum-Entropy models in the corresponding directory. Those have been trained during June 2024 via 88k health-related Wikipedia articles.
- Apache Maven in version 3.6+
- Java / OpenJDK in version 17+
- Apache OpenNLP in version 2.3.0+
- OpenNLP releases < 2.1.0 can't reliably load the NER model files of this project!
- Check and take care of your classpath so no older OpenNLP version is around!
Build the project via Apache Maven.
The command for the relevant parts is mvn clean package
.
This should download all required dependencies which are:
- Apache OpenNLP,
- Apache Commons Lang3, and
- slf4j + log4j2 bindings.
If you want to re-use the current, experimental version of DE-NERmed in your projects,
execute mvn clean install
to transport the bundled jar file to your local .m2
folder.
Note:
You have to select one or more model files and copy it over to the execution environment.
Those models must reside in the models
directory, as the current code inspects this directory name.
For a first impression, just execute DENerDemo.java
which will, by default, load the DE-NERmed-Wiki_2023_medium-maxent.bin
model resource. The NamedEntityRecognizer
instance will then find the NEs for German nouns from the (bio)medical domain.
Important
For reasons of limited LFS storage, no model file will be included in the models
directory of this Git repository, if you clone this repository.
You will have to download all other model files separately.
Once retrieved, place those model files in the models
directory to start experimenting with it.
In the demo example, the German sentence Der Urin des Patienten ist rot verfärbt.
will be processed.
The results are logged to STD out / console. It should be similar to:
INFO [main] OpenNLPModelServiceImpl(50) - Importing NLP model file 'DE-NERmed-Wiki_2023-maxent.bin' ...
INFO [main] DENerMedDemo(50) - Detecting NEs for: 'Der Urin des Patienten ist rot verfärbt.'
INFO [main] DENerMedDemo(80) - Found NE 'Urin' - [pos: 2, prob: 0,94]
Once the default examples are processed, you can enter your text fragments for testing via an interactive mode.
Hit q
to quit the demo program and free up RAM on your local machine.
The complete set of files consists of two models:
Model name | F1 | Acc | Binary size | RAM required | External download required |
---|---|---|---|---|---|
DE-NERmed-Wiki_2023-maxent.bin | 0.8761 | 0.8922 | 342.2MB | ~4096MB | Yes |
DE-NERmed-Wiki_2023-medium-maxent.bin | 0.8543 | 0.8754 | 57.6MB | ~672MB | Yes |
Table 1: Relevant properties of DE-NERmed models: performance (F1, Accuracy), binary size and memory required.
During the preparation phase, a synthetic text corpus was compiled, comprising of close to 88.000 health-related Wikipedia articles and its German full texts, dating July 2023. The corpus contained ~2.4 mio sentences. Next, these raw texts were automatically pre-annotated based on a full UMLS concept list; it included 56.600 medical NEs, translated into German.
The training of NER models was conducted based on the open-source NLP toolkit Apache OpenNLP. For the generation of NER models, the OpenNLP training parameters were chosen as follows:
training.algorithm=maxent
training.iterations=300
training.cutoff=3
training.threads=8
use.token.end=false
language=de
The resulting binary model files were persisted for evaluation and later re-use in NLP applications with a NER component.
For the performance evaluation of the DE-NERmed models, n=101 text fragments were randomly selected from discharge letters, originally created in the Chest Pain Unit at the Heidelberg University Hospital. For inclusion, a text fragment had to consist of at least 20 tokens. For each sample, relevant, that is true positive and true negative, medical concepts were manually annotated by the author of this work.
Note
However, for legal (data protection) reasons, the Eval corpus cannot be made public or passed on to individuals or third parties.
The (large) model (DE-NERmed-Wiki_2023-maxent.bin
) achieved an F1 score of 0.8761 and an Accuracy of 0.8922 (TP=905; TN=1214; FP=65; FN=191), see Table 1 above.
It detected most of the relevant medical NEs, associated with the cardiology and the general medical domain.
Misclassifications occurred primarily for NEs which were representative for both, the general and the medical language.
If you use one of the DE-NERmed models of this project, or the code of this repository in your scientific work, please cite the GMDS 2024 paper as follows:
📝
Wiesner M. DE-NERmed: A Named Entity Recognition Model for the Detection of German Medical Entities. Gesundheit – gemeinsam. Kooperationstagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (GMDS), Deutschen Gesellschaft für Sozialmedizin und Prävention (DGSMP), Deutschen Gesellschaft für Epidemiologie (DGEpi), Deutschen Gesellschaft für Medizinische Soziologie (DGMS) und der Deutschen Gesellschaft für Public Health (DGPH). Dresden, 08.-13.09.2024. Düsseldorf: German Medical Science GMS Publishing House; 2024. DocAbstr. 979DOI: 10.3205/24gmds182