A project designed to extract structured data from clinical text-based notes
This repository contains SQL and Python code written to capture and parse structured data from clinical text-based notes from an electronic medical record (EMR).
The SQL code identifies cases based on the presence of a text string embedded in the note template, in specific note types. The initial versions are designed to query specific fields in Clarity (Epic), but the general method can be adapted for use with other relational database.
The Python code takes the results of the SQL query and extracts structured data from the clinical note by identifying a simple repeating motif within the note generated by a purpose-built note template. The motif has the following structure for each field:
For example:
Hair color: 3 - Brown varHairColor;
As the enumerations generated by the custom note template are designed to have an integer key and a non-integer text value, these can be parsed individually or together as a key-value pair.
If the parser is set up to extract only the integer data from the enumeration string data (as is the case with the initial versions of the Python code included here), the resulting output would look like this:
subject hairColor 1 3 2 2 3 1 4 3
...