This repository contains the code to reproduce the results in: Florian Borchert, Christina Lohr, Luise Modersohn, Thomas Langer, Markus Follmann, Jan Philipp Sachs, Udo Hahn, Matthieu-P. Schapranow. GGPONC: A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines. ArXiv:2007.06400 [Cs]. [arXiv] [Code on GitHub] [data-request@DKG] accepted at [LOUHI@EMNLP'20)
- GGPONC source files:
- Follow the instructions of the GGPONC website (Access & Download)
- Copy
- PubMed Abstracts from German Case Reports and Case Descriptions
- Install Entres API from NCBI or EDirect, the commandline tools requesting the PubMed infrastructure
- Open a terminal and type
esearch -db pubmed -query "Case Reports[Publication Type] AND GER[LA]" | efetch -format xml > allGermanPubMedCaseAbstracts.xml
(This step could take an hour.) - export the extracted file
- JSYNCC v1.1: follow the instructions of or contact Christina Lohr
- 3000PA: no public access
- KRAUTS Corpus (Strötgen et al):
- WikiWarsDe Corpus (Strötgen et al)
You need files from the UMLS.
You need a registration at UTS, you can download the UMLS files from the U.S. National Library of Medicine (NIH).
For our current work, we used the UMLS release 2019AB and you need the following files:
- 2019AB MRSTY.RRF (only accessible from the full release zip file.)
- unzip the files.
More information on the UMLS can be found in the UMLS® Reference Manual.
- Java 11 - We prefer Open JDK
- Apache Maven (mvn)
- Python 3 => We prefer to use Eclipse IDE or IntelliJ IDEA
- Configure the project as a Maven project
- In Eclipse: right click on project => Configure => Convert to Maven Project
- Command line:
mvn compile
- Run
mvn compile
before executingmvn exec:java -Dexec.mainClass="de.hpi.guidelines.reader.GGPOncXMLReader" -Dexec.args="<Path to cpg-corpus-cms.xml>"
(in packagede.hpi.guidelines.reader
) in Eclipse (Run As => Java Application) - Wait a minute
- Look into the directory
- We download PubMed data at February 21 2020, if you download PubMed data by esearch commands, you will receive a larger text corpus than our export. The file
contains a list with the used PubMed identifiers from February 21 2020. - If you want to create the described data set from PubMed, import your extracted XML file and run the
. This code is able to filter our used PubMed text data from your new created download.
- We worked with JuFit v1.1 - you can find the right jar file in this repository.
- If you want to work with the real JuFit, follow the steps below:
- Download JuFit from
- create the jar file by Apache Maven and run
mvn clean package
- run
java -jar JuFiT.jar MRCONSO.RRF MRSTY.RRF GER --grounded > UMLS_dict.txt
- Run the Java Code
) or the Python scriptextended_script_dictionaries/
- We used a list of gene names compiled from Entrez Gene and UniProt with the approach originating from Wermter et al.
- Code of JULIELab/gene-name-mapping
- The integration of this code in the GGPOnc Repository is coming soon.
- For the usage of JCoRe Pipelines you will need one large file
- Run the script
to create on large dictionary (before run: adapt path names in the script file) - Or run the Java Code
) (before run: adapt path names in the script file)
- Unpack the
files injcore-pipelines
, there are 2 pipelines:- dectectUMLSentries
- detectStopwords
- Create the folder
in the pipeline directories and put the data to be analyzed in the directorydata/files
(subdirectories are not read, be carefully with*.tar
files) - Put the global dictionary file into
- Adapt filename of the dictionary and the stopword dictionary in the following files:
Template Descriptor with Configurable ExternalResource.xml
Template Descriptor with Configurable ExternalResource.xml
- Open a terminal and root into one of the pipeline directories
- Start the pipeline with
java -jar ../jcore-pipeline-runner-base-0.4.1-SNAPSHOT-cli-assembly.jar run.xml
- Results
- This JCoRe pipeline is derived of the JULIE Lab own jcore-pipeline-modules (see also
- To calculate the inter-annotator-agreement between human annotators follow the instructions of bratiaa
- To calculate precision and recall between automatically created annotations and the human annotated data run:
pip install bratutils
python src/main/python/ <path to gold annotations> <path to automatic annotations>