This guide will help you reproduce the COVID study from scratch.
This includes not only the SQL in this Cumulus Library study, but also the chart review side of things.
- An existing Cumulus stack, with an already-built
core
study.- See the general Cumulus documentation for setting that up.
- Install this repo with
pip install cumulus-library-covid
This study operates on DocumentReference resources (it runs NLP on the referenced clinical notes). So we need to make sure you've got those handy.
Gather some DocumentReference ndjson from your EHR. You can either re-export the documents of interest, or use ndjson from a previous export.
If you are choosing a subset of documents, make sure to pull resources between March 2020 and June 2022. That's the study period of interest.
Place the ndjson in a folder using filenames like *.DocumentReference.*.ndjson
.
- There are separate instructions for running the ETL and this COVID study's NLP
- You should probably re-run your Cumulus AWS Glue crawler at this point, to pick up this new NLP table and its schema.
- Then run this study with cumulus-library
like so:
cumulus-library ... -t covid_symptom
You should now have all the interesting results sitting in Athena.
In Athena's web console, run these commands and download the CSV results, using the given filenames (we will refer back to these filenames later):
- ctakes.csv (if you ran cTAKES):
select encounter_ref, symptom_display from covid_symptom__symptom_ctakes_negation
- gpt35.csv (if you ran ChatGPT 3.5):
select encounter_ref, symptom_display from covid_symptom__symptom_gpt35
- gpt4.csv (if you ran ChatGPT 4):
select encounter_ref, symptom_display from covid_symptom__symptom_gpt4
- docrefs.csv:
select distinct docref_id from covid_symptom__symptom_ctakes_negation
- icd10.csv:
select encounter_ref, substring(icd10_display, 7) as symptom_display from covid_symptom__symptom_icd10
And with that, the natural language processing of notes is finished. The rest of this guide will be about setting up a chart review for human comparison with NLP.
- Install Label Studio according to their docs.
- Create a new project, named however you like.
- Skip the Data Import tab.
- On the Label Setup tab, click "Custom template" on the bottom left and enter this config:
<View>
<Labels name="label" toName="text">
<Label value="Congestion or runny nose" background="#100"/>
<Label value="Cough" background="#040"/>
<Label value="Diarrhea" background="#008"/>
<Label value="Dyspnea" background="#b00"/>
<Label value="Fatigue" background="#0f0"/>
<Label value="Fever or chills" background="#40a"/>
<Label value="Headache" background="#afa"/>
<Label value="Loss of taste or smell" background="#f0f"/>
<Label value="Muscle or body aches" background="#9bf"/>
<Label value="Nausea or vomiting" background="#0aa"/>
<Label value="Sore throat" background="#a44"/>
</Labels>
<Text name="text" value="$text"/>
</View>
Once created, you will be looking at an empty project page.
Take note of the new URL, you'll need to know the Label Studio project ID later
(the number after /projects/
in the URL).
- Review the Cumulus ETL upload-notes docs
- You'll want to run
upload-notes
with the following options:
cumulus-etl upload-notes ... \
<input folder with ndjson files from step 1 above> \
<label studio url> \
<your typical ETL PHI folder> \
--philter=disable \
--no-nlp \
--anon-docrefs docrefs.csv
Remember to pass any other required parameters like --ls-project
and --ls-token
(from the linked docs above).
If your DocumentReferences hold links to EHR resources (rather than inlined data),
you will also need to pass the usual ETL --fhir-url
flag and its related authentication flags.
Once this is done, go to your project page in Label Studio and you should see a lot of charts.
Give them access to Label Studio and have them annotate the charts.
On the Label Studio project page, click "Export" and keep the default JSON format.
Save this file as labelstudio-export.json
in a new folder.
NOTE: this file is PHI as it contains the note text. Use appropriate caution.
- Run
pip install chart-review
- Copy the
.csv
files from step 3 above into the same folder you used forlabelstudio-export.json
above (the "chart review folder"). - Add a new
config.yaml
file in that folder:
labels:
- Congestion or runny nose
- Cough
- Diarrhea
- Dyspnea
- Fatigue
- Fever or chills
- Headache
- Loss of taste or smell
- Muscle or body aches
- Nausea or vomiting
- Sore throat
annotators:
human1: 1
human2: 2
ctakes:
filename: ctakes.csv
gpt35:
filename: gpt35.csv
gpt4:
filename: gpt4.csv
icd10:
filename: icd10.csv
Replace human1
and human2
with the names of your annotators
and replace the numbers there with their respective user IDs in Label Studio.
Now from within your chart review folder,
run chart-review accuracy human1 ctakes
(again, replacing human1
with the annotator's name).
This will score cTAKES' performance against the ground truth of the human annotator.
You should see a chart of F1 scores, true-positive counts, etc. Which will tell you how accurate cTAKES NLP (plus negation) is, as well as how accurate ICD10 codes are.
Read the chart-review documentation for more information on its features.