This project contains a set of Python scripts that transforms onFeast generated datasets into Obiba-compliant datasets to enable the population of DataTools4Heart Data Catalogue.
Two different data pre-processing pipelines are required:
- Creation of a 'Features Dictionary' from a 'onFeast Dataset Metadata'
- Creation of a 'Availability Data' from a 'onFeast Features Dataset'
Features Dictionary
(INPUT): codebook of features (variables) included in a dataset. Contains information about each variable, including its name, value type, entity type, and various attributes. Formatted as an Excel spreadsheet (.XLS or .XLSX), the dictionary typically consists of two main components:- Variables tab: Includes details such as the variable name, value type (integer, decimal, text), entity type (Participant, Instrument), and other properties like units, labels, and aliases.
- Categories tab: This describes the possible values or categories for categorical variables
onFeast Dataset Metadata
(OUTPUT): onFeast API generated file containing detailed information about the extracted AI-ready datasets. Formatted as a JSON, they are linked to onFeast Datasets.
- Python 3.7+
- pandas
- openpyxl
- Clone this repository and install the required packages:
pip install -r requirements.txt
Run the script from the command line with the following syntax:
usage: python3 src/datasetMetadata_to_obibaFeaturesDict.py [-h] json_file xlsx_file table_name
Convert DT4H Dataset Metadata objects (JSON) into Obiba-compliant Variable dictionaries (Excel) for Opal
positional arguments:
json_file Path to input DT4H Dataset Metadata JSON file (dt4h_dataset_metadata format)
xlsx_file Path to output XLSX Excel file (obiba_dictionary format)
table_name Name of the Opal table to create
options:
-h, --help show this help message and exit
python3 src/datasetMetadata_to_obibaFeaturesDict.py sample_data/study1_metadata.json sample_data/study1_dict.xlsx study1
Make sure no parsing errors have ocurred during tranformation. To do so, open the resulting Excel file and find the following keywords:
- "error"
- "unknown"
Manually replace them by the right values.
Features Dictionary
(INPUT): patient-level tabular dataset describing the availability or missingness of features in a cohort dataset - like onFeast Features Datasets. As in 'Features Datasets', columns correspond to features (or variables) and rows to encounters. However, the values are not anymore the actual health or clinical data, but only 0 or 1 indicating the presence or absence of that feature for that encounter. Format: CSV- 1: feature present for that encounter
- 0: feature missing for that encounter
onFeast Features Dataset
(OUTPUT): AI-ready tabular datasets containing extracted features generated by onFHIR API Datset endpoints. Format: PARQUET
- Python 3.8+
- pandas
- openpyxl
- Clone this repository and install the required packages:
pip install -r requirements.txt
Run the script from the command line with the following syntax:
Transform Parquet data to availability CSV
positional arguments:
input_parquet Path to input Parquet file
output_csv Path to output CSV file
options:
-h, --help show this help message and exit
--dictionary DICTIONARY
Optional OBiBa dictionary XLS file
python3 src/datasetFeatures_to_obibaAvailabilityData.py sample_data/center1_dataset.parquet sample_data/center1_availability.csv