This repository contains feature extraction definitions that process patient data represented in the DT4H CDM and tranform it to a tabular format, which would be used to train ML models. Feature extraction process is realized via four main concepts, namely populations, feature groups, feature sets and pipelines.
Broadly, the feature extraction suite extracts patients' data from the FHIR patient data repository based on population definition.
Then, feature groups' main aim is to extract a group of raw features for specific healthcare resources such as conditions, medications, lab measurements, etc. For each feature group a timeseries table is created such that
- Each record specified matching to the FHIR query of the feature group will be mapped as a row in the table
- Each feature defined in the feature group will be converted a column in the table
In the next step, feature sets work on the timeseries data generated by the feature groups to extract the final tabular dataset. Feature sets allow the following dataset manipulations:
- Identification of reference time points that would lead to data points in the final dataset
- Grouping data based on the reference time points in configurable time periods
- Applying aggregations on the grouped data
Pipelines are used to associate feature sets and populations. This indicates that a dataset, as configured by the feature sets, will be generated for the specified population in the pipeline.
When looked into the current definitions, the feature group defined so far are mainly driven by the DT4H CDM profiles, vital signs, encounters, electrocardiographs, medications, etc.
The study-features.json contains the input (independent) and output (dependent) variables that are required for the sub-study 1, which is "Medication prescription in patients with acute heart failure and chronic kidney disease or hyperkalaemia".
- Completing the deployment instructions of the data-ingestion-suite.
After mapping the data source to the common data model, the feature extraction process can be started. DT4H feature extraction configurations are maintained in the project’s GitHub repository.
Navigate into a working directory to run the tools: <workspaceDir>
git clone https://github.com/DataTools4Heart/feature-extraction-suite
Run the following scripts in the <workspaceDir>
:
sh ./feature-extraction-suite/docker/pull.sh
sh ./feature-extraction-suite/docker/run.sh
- For
feature-extraction-suite
deployment, data-ingestion-suite must first be deployed successfully and mapping must be run. If you used the Nginx Docker container during thedata-ingestion-suite
deployment, update the Nginx config forfeature-extraction-suite
by following these steps:
# Navigate into the working directory
cd <workspaceDir>
# Stop the current proxy
./data-ingestion-suite/docker/proxy/stop.sh
# Edit the nginx.conf file
# Uncomment lines between 65-69 in:
# ./data-ingestion-suite/docker/proxy/nginx.conf
# Restart the proxy
./data-ingestion-suite/docker/proxy/run.sh
- Or, if your host machine is already running Nginx, insert the following proxy configuration and restart Nginx:
location /dt4h/feast {
proxy_pass http://onfhir-feast:8085/onfhir-feast;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
- Send a POST request to this URL to start the extraction process:
https://<hostname>/dt4h/feast/api/DataSource/myFhirServer/FeatureSet/study1-fs/Population/study1/$extract
-
The extraction process may take a long time to complete depending on the size of data.
-
After completion, the extracted dataset file should be generated. Example file location:
<workspaceDir>/feature-extraction-suite/output-data/myFhirServer/dataset/study1-fs/<datasetId>/part-00000-550c22da-d8e3-4113-8b3a-8d935e77ee06-c000.snappy.parquet
- For statistics about the dataset:
https://<hostname>/dt4h/feast/api/Dataset
Or:
https://<hostname>/dt4h/feast/api/Dataset/<datasetId>