docs: add a basic "how to run study" doc

smart-on-fhir · Jun 17, 2024 · 66b18e4 · 66b18e4
1 parent 7099e87
commit 66b18e4
Show file tree

Hide file tree

Showing 2 changed files with 164 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -1,15 +1,17 @@
 # Cumulus Library - Covid
 
-A collection of tables for generating bioinfomatics data for studying COVID-19 symptoms. 
-Part of the [SMART on FHIR Cumulus Project](https://smarthealthit.org/cumulus-a-universal-sidecar-for-a-smart-learning-healthcare-system/)
+A collection of tables for generating bioinformatics data for studying COVID-19 symptoms.
+Part of the [SMART on FHIR Cumulus Project](https://smarthealthit.org/cumulus/).
 
-For more information, [browse the documentation](https://docs.smarthealthit.org/cumulus/library).
+For more information, browse the [Cumulus Library documentation](https://docs.smarthealthit.org/cumulus/library).
 
 ## Usage
 
 To install the module, simply run `pip install cumulus-library-covid`.
 
-This will add a `covid_symptoms` study target to `cumulus-library`.
+This will add a `covid_symptom` study target to `cumulus-library`.
+
+See [RUNNING.md](RUNNING.md) for more details.
 
 ## Publications
 

diff --git a/RUNNING.md b/RUNNING.md
@@ -0,0 +1,158 @@
+# Running the COVID study
+
+This guide will help you reproduce the COVID study from scratch.
+
+This includes not only the SQL in this Cumulus Library study,
+but also the chart review side of things.
+
+## Prerequisites
+
+- An existing Cumulus stack, with an already-built `core` study.
+  - See the general [Cumulus documentation](https://docs.smarthealthit.org/cumulus/)
+    for setting that up.
+- Install this repo with `pip install cumulus-library-covid`
+
+## 1. Prepare your data
+
+This study operates on DocumentReference resources
+(it runs NLP on the referenced clinical notes).
+So we need to make sure you've got those handy.
+
+Gather some DocumentReference ndjson from your EHR.
+You can either re-export the documents of interest,
+or use ndjson from a previous export.
+
+If you are choosing a subset of documents,
+make sure to pull resources between March 2020 and June 2022.
+That's the study period of interest.
+
+Place the ndjson in a folder using filenames like `*.DocumentReference.*.ndjson`.
+
+## 2. Run the ETL & Library study
+
+- There are [separate instructions](https://docs.smarthealthit.org/cumulus/etl/studies/covid-symptom.html)
+  for running the ETL and this COVID study's NLP
+- You should probably re-run your Cumulus AWS Glue crawler at this point,
+  to pick up this new NLP table and its schema.
+- Then run this study with [cumulus-library](https://docs.smarthealthit.org/cumulus/library/)
+  like so: `cumulus-library ... -t covid_symptom`
+
+You should now have all the interesting results sitting in Athena.
+
+## 3. Export from Athena
+
+In Athena's web console, run these commands and download the CSV results,
+using the given filenames (we will refer back to these filenames later):
+- **ctakes.csv**: `select encounter_ref, symptom_display from covid_symptom__symptom_ctakes_negation`
+- **docrefs.csv**: `select distinct docref_id from covid_symptom__symptom_ctakes_negation`
+- **icd10.csv**: `select encounter_ref, substring(icd10_display, 7) as symptom_display from covid_symptom__symptom_icd10`
+
+And with that, the natural language processing of notes is finished.
+The rest of this guide will be about setting up a chart review for human comparison with NLP.
+
+## 4. Configure Label Studio
+
+- Install Label Studio according to [their docs](https://labelstud.io/guide/install.html).
+- Create a new project, named however you like.
+  - Skip the Data Import tab.
+  - On the Label Setup tab, click "Custom template" on the bottom left and enter this config:
+```
+<View>
+  <Labels name="label" toName="text">
+    <Label value="Congestion or runny nose" background="#100"/>
+    <Label value="Cough" background="#040"/>
+    <Label value="Diarrhea" background="#008"/>
+    <Label value="Dyspnea" background="#b00"/>
+    <Label value="Fatigue" background="#0f0"/>
+    <Label value="Fever or chills" background="#40a"/>
+    <Label value="Headache" background="#afa"/>
+    <Label value="Loss of taste or smell" background="#f0f"/>
+    <Label value="Muscle or body aches" background="#9bf"/>
+    <Label value="Nausea or vomiting" background="#0aa"/>
+    <Label value="Sore throat" background="#a44"/>
+  </Labels>
+  <Text name="text" value="$text"/>
+</View>
+```
+
+Once created, you will be looking at an empty project page.
+Take note of the new URL, you'll need to know the Label Studio project ID later
+(the number after `/projects/` in the URL).
+
+## 5. Upload notes to Label Studio
+
+- Review the Cumulus ETL [upload-notes docs](https://docs.smarthealthit.org/cumulus/etl/chart-review.html)
+- You'll want to run `upload-notes` with the following options:
+```shell
+cumulus-etl upload-notes ... \
+  <input folder with ndjson files from step 1 above> \
+  <label studio url> \
+  <your typical ETL PHI folder> \
+  --philter=disable \
+  --no-nlp \
+  --anon-docrefs docrefs.csv
+```
+
+Remember to pass any other required parameters like `--ls-project` and `--ls-token`
+(from the linked docs above).
+If your DocumentReferences hold links to EHR resources (rather than inlined data),
+you will also need to pass the usual ETL `--fhir-url` flag and its related authentication flags.
+
+Once this is done, go to your project page in Label Studio and you should see a lot of charts.
+
+## 6. Have subject-matter experts review the uploaded charts
+
+Give them access to Label Studio and have them annotate the charts.
+
+## 7. Export the annotated charts from Label Studio
+
+On the Label Studio project page, click "Export" and keep the default JSON format.
+
+Save this file as `labelstudio-export.json` in a new folder.
+
+**NOTE: this file is PHI as it contains the note text.** Use appropriate caution.
+
+## 8. Set up `chart-review`
+
+- Run `pip install chart-review`
+- Copy `ctakes.csv` and `icd10.csv` from step 3 above into the same folder
+  you used for `labelstudio-export.json` above (the "chart review folder").
+- Add a new `config.yaml` file in that folder:
+```yaml
+labels:
+  - Congestion or runny nose
+  - Cough
+  - Diarrhea
+  - Dyspnea
+  - Fatigue
+  - Fever or chills
+  - Headache
+  - Loss of taste or smell
+  - Muscle or body aches
+  - Nausea or vomiting
+  - Sore throat
+
+annotators:
+  human1: 1
+  human2: 2
+  ctakes:
+    filename: ctakes.csv
+  icd10:
+    filename: icd10.csv
+```
+
+Replace `human1` and `human2` with the names of your annotators
+and replace the numbers there with their respective user IDs in Label Studio.
+
+## 9. Run `chart-review`
+
+Now from within your chart review folder,
+run `chart-review accuracy human1 ctakes` (again, replacing `human1` with the annotator's name).
+This will score cTAKES' performance against the ground truth of the human annotator.
+
+You should see a chart of F1 scores, true-positive counts, etc.
+Which will tell you how accurate cTAKES NLP (plus negation) is,
+as well as how accurate ICD10 codes are.
+
+Read the [chart-review documentation](https://docs.smarthealthit.org/cumulus/chart-review/)
+for more information on its features.