Skip to content

Commit

Permalink
GeMTeX Use-Case
Browse files Browse the repository at this point in the history
  • Loading branch information
reckart committed Nov 15, 2024
1 parent d22c298 commit 43ef43d
Show file tree
Hide file tree
Showing 3 changed files with 53 additions and 2 deletions.
4 changes: 2 additions & 2 deletions _use-cases/digital-athenaeus/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ in separate lines. <a href="#figure1">Figure 1</a> shows an example of a TSV fil
annotated named entities and corresponding lemmata (Ath., *Deipn*. 1.7):

<figure id="figure1">
<img src="image1.jpeg" alt="my alt text"/>
<img src="image1.jpeg" alt="Data example"/>
<figcaption>Figure 1. TSV 3.2 file format (NEs and lemmata in Ath., Deipn. 1.7)</figcaption>
</figure>

Expand All @@ -64,7 +64,7 @@ one for the named entity tag, and the other for the lemma (Ath.,
*Deipn*. 1.7):

<figure id="figure2">
<img src="image2.jpg" alt="my alt text"/>
<img src="image2.jpg" alt="Screenshot"/>
<figcaption>Figure 2. INCEpTION: pre-annotated data (Ath., Deipn. 1.7)</figcaption>
</figure>

Expand Down
1 change: 1 addition & 0 deletions _use-cases/gemtex/image1.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
50 changes: 50 additions & 0 deletions _use-cases/gemtex/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
title: GeMTeX
subheadline: German Medical Text Corpus
permalink: /use-cases/gemtex/
#screenshot: screenshot.png
#thumbnail: screenshot-thumb.png
hidden: false
---

**Source**: <i>This example was kindly contributed by
<a href="https://www.informatik.tu-darmstadt.de/ukp/ukp_home/staff_ukp/detailseite_mitarbeiter_1_41920.en.jsp">Christina Lohr</a>,
Institute for Medical Informatics, Statistics and Epidemiology of University Leipzig, Germany</i>


In everyday clinical practice, numerous texts are generated, such as doctors' letters and reports, which contain valuable information about the progression, development, and treatment of diseases. These texts could be leveraged by natural language processing (NLP) tools to support healthcare professionals and researchers. However, the full potential of these clinical documents remains untapped due to a lack of standardization. The [GeMTeX (German Medical Text Corpus)][1] platform aims to bridge this gap by making medical texts from patient care available for research projects. The goal is to create ob of the largest corpus of medical texts in the German language.

Through the GeMTeX project, six university medical centers in Munich, Leipzig, Essen, Berlin, Dresden, and Erlangen are collecting documents from electronic patient records (ePA) with patient consent. These documents are annotated using the INCEpTION annotation platform. Using NLP techniques, the documents are processed in compliance with data protection regulations and made available in anonymized form for shared research use. This effort creates a valuable text corpus for research and development.

The initial phase of the project focuses on de-identification—a process that obscures information that could reveal personal identities. To facilitate this, medical students from the universities of Leipzig and Erlangen, along with a team of experts in linguistics, medicine, and informatics, conducted a pilot study. They created 1,438 annotations on simulated doctors' letters from the Graz Synthetic Text Corpus (GRASCCO).

The annotated documents were published on the international research data platform [Zenodo][3], serving as a model for future projects. Alongside the annotated corpus, a publication was released that describes the process for de-identifying medical documents. This "de-identification pipeline" includes the following steps:

* Exporting clinical texts as raw data from the local hospital information system
* Importing the data into the INCEpTION annotation platform
* Automated preliminary annotation of text segments with personally identifiable information using the Averbis Health Discovery Pipeline
* Manual review and correction of annotations through a two-person verification process
* Automated replacement of pre-annotated and corrected data with appropriate pseudonyms

<figure id="figure2">
<img src="image1.svg" alt="Processing and annotation workflow"/>
<figcaption>Figure 1. Processing and annotation workflow</figcaption>
</figure>


The project is currently work in progress. During the further course of the project, INCEpTION will be used to create, correct and curate additional annotations conforming to the [SNOMED-CT][2] terminology.

##### References

* Lohr C, Matthies F, Faller J, et al.
De-Identifying GRASCCO - A Pilot Study for the De-Identification of the German Medical Text Project (GeMTeX) Corpus.
In: Stud Health Technol Inform. 2024; 317:171-179.
[[doi:10.3233/SHTI240853]](https://doi.org/10.3233/shti240853)

* Lohr, C., Matthies, F., Jakob, F., Modersohn, L., Riedel, A., Hahn, U., Kiser, R., Boeker, M., & Meineke, F. (2024).
GraSCCo_PHI - Graz Synthetic Clinical text Corpus with Protected Health Information Annotations [Data set]. Zenodo.
[https://doi.org/10.5281/zenodo.11502329][3]

[1]: https://www.smith.care/en/gemtex_mii/about-gemtex/
[2]: https://www.snomed.org
[3]: https://zenodo.org/records/11502329

0 comments on commit 43ef43d

Please sign in to comment.