diff --git a/_use-cases/digital-athenaeus/index.md b/_use-cases/digital-athenaeus/index.md index 89da5b1..0bf70c8 100644 --- a/_use-cases/digital-athenaeus/index.md +++ b/_use-cases/digital-athenaeus/index.md @@ -54,7 +54,7 @@ in separate lines. Figure 1 shows an example of a TSV fil annotated named entities and corresponding lemmata (Ath., *Deipn*. 1.7):
- my alt text + Data example
Figure 1. TSV 3.2 file format (NEs and lemmata in Ath., Deipn. 1.7)
@@ -64,7 +64,7 @@ one for the named entity tag, and the other for the lemma (Ath., *Deipn*. 1.7):
- my alt text + Screenshot
Figure 2. INCEpTION: pre-annotated data (Ath., Deipn. 1.7)
diff --git a/_use-cases/gemtex/image1.svg b/_use-cases/gemtex/image1.svg new file mode 100644 index 0000000..52a7b42 --- /dev/null +++ b/_use-cases/gemtex/image1.svg @@ -0,0 +1 @@ +HospitalInformationSystemrawtexttextcorpusde-identifikationSemanticAnnotationResearchAverbis HealthDisvoveryde-idpipelinelocalautomat.detektionIDAT/ PHIINCEpTIONManualcorrectionIDAT / PHI4-EyesprinciplelocalPseudonymizationreplacementviasurrogatesofIDAT/PHI \ No newline at end of file diff --git a/_use-cases/gemtex/index.md b/_use-cases/gemtex/index.md new file mode 100644 index 0000000..46b0953 --- /dev/null +++ b/_use-cases/gemtex/index.md @@ -0,0 +1,50 @@ +--- +title: GeMTeX +subheadline: German Medical Text Corpus +permalink: /use-cases/gemtex/ +#screenshot: screenshot.png +#thumbnail: screenshot-thumb.png +hidden: false +--- + +**Source**: This example was kindly contributed by +Christina Lohr, + Institute for Medical Informatics, Statistics and Epidemiology of University Leipzig, Germany + + +In everyday clinical practice, numerous texts are generated, such as doctors' letters and reports, which contain valuable information about the progression, development, and treatment of diseases. These texts could be leveraged by natural language processing (NLP) tools to support healthcare professionals and researchers. However, the full potential of these clinical documents remains untapped due to a lack of standardization. The [GeMTeX (German Medical Text Corpus)][1] platform aims to bridge this gap by making medical texts from patient care available for research projects. The goal is to create ob of the largest corpus of medical texts in the German language. + +Through the GeMTeX project, six university medical centers in Munich, Leipzig, Essen, Berlin, Dresden, and Erlangen are collecting documents from electronic patient records (ePA) with patient consent. These documents are annotated using the INCEpTION annotation platform. Using NLP techniques, the documents are processed in compliance with data protection regulations and made available in anonymized form for shared research use. This effort creates a valuable text corpus for research and development. + +The initial phase of the project focuses on de-identification—a process that obscures information that could reveal personal identities. To facilitate this, medical students from the universities of Leipzig and Erlangen, along with a team of experts in linguistics, medicine, and informatics, conducted a pilot study. They created 1,438 annotations on simulated doctors' letters from the Graz Synthetic Text Corpus (GRASCCO). + +The annotated documents were published on the international research data platform [Zenodo][3], serving as a model for future projects. Alongside the annotated corpus, a publication was released that describes the process for de-identifying medical documents. This "de-identification pipeline" includes the following steps: + +* Exporting clinical texts as raw data from the local hospital information system +* Importing the data into the INCEpTION annotation platform +* Automated preliminary annotation of text segments with personally identifiable information using the Averbis Health Discovery Pipeline +* Manual review and correction of annotations through a two-person verification process +* Automated replacement of pre-annotated and corrected data with appropriate pseudonyms + +
+ Processing and annotation workflow +
Figure 1. Processing and annotation workflow
+
+ + +The project is currently work in progress. During the further course of the project, INCEpTION will be used to create, correct and curate additional annotations conforming to the [SNOMED-CT][2] terminology. + +##### References + +* Lohr C, Matthies F, Faller J, et al. + De-Identifying GRASCCO - A Pilot Study for the De-Identification of the German Medical Text Project (GeMTeX) Corpus. + In: Stud Health Technol Inform. 2024; 317:171-179. + [[doi:10.3233/SHTI240853]](https://doi.org/10.3233/shti240853) + +* Lohr, C., Matthies, F., Jakob, F., Modersohn, L., Riedel, A., Hahn, U., Kiser, R., Boeker, M., & Meineke, F. (2024). + GraSCCo_PHI - Graz Synthetic Clinical text Corpus with Protected Health Information Annotations [Data set]. Zenodo. + [https://doi.org/10.5281/zenodo.11502329][3] + +[1]: https://www.smith.care/en/gemtex_mii/about-gemtex/ +[2]: https://www.snomed.org +[3]: https://zenodo.org/records/11502329