Skip to content

DataTools4Heart/data-catalogue-preprocessing-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Dataset preparation toolkit for DataTools4Heart Data Catalogue

This project contains a set of Python scripts that transforms onFeast generated datasets into Obiba-compliant datasets to enable the population of DataTools4Heart Data Catalogue.

Two different data pre-processing pipelines are required:

  • Creation of a 'Features Dictionary' from a 'onFeast Dataset Metadata'
  • Creation of a 'Availability Data' from a 'onFeast Features Dataset'

Creation of a 'Features Dictionary' from a 'onFeast Dataset Metadata'

  • Features Dictionary (INPUT): codebook of features (variables) included in a dataset. Contains information about each variable, including its name, value type, entity type, and various attributes. Formatted as an Excel spreadsheet (.XLS or .XLSX), the dictionary typically consists of two main components:
    • Variables tab: Includes details such as the variable name, value type (integer, decimal, text), entity type (Participant, Instrument), and other properties like units, labels, and aliases.
    • Categories tab: This describes the possible values or categories for categorical variables
  • onFeast Dataset Metadata (OUTPUT): onFeast API generated file containing detailed information about the extracted AI-ready datasets. Formatted as a JSON, they are linked to onFeast Datasets.

Requirements

  • Python 3.7+
  • pandas
  • openpyxl

Installation

  1. Clone this repository and install the required packages:
pip install -r requirements.txt

Usage

Run the script from the command line with the following syntax:

usage: python3 src/datasetMetadata_to_obibaFeaturesDict.py [-h] json_file xlsx_file table_name

Convert DT4H Dataset Metadata objects (JSON) into Obiba-compliant Variable dictionaries (Excel) for Opal

positional arguments:
  json_file   Path to input DT4H Dataset Metadata JSON file (dt4h_dataset_metadata format)
  xlsx_file   Path to output XLSX Excel file (obiba_dictionary format)
  table_name  Name of the Opal table to create

options:
  -h, --help  show this help message and exit

Example

python3 src/datasetMetadata_to_obibaFeaturesDict.py sample_data/study1_metadata.json sample_data/study1_dict.xlsx study1

Validate data dictionary

Make sure no parsing errors have ocurred during tranformation. To do so, open the resulting Excel file and find the following keywords:

  • "error"
  • "unknown"

Manually replace them by the right values.

Creation of a 'Availability Data' from a 'onFeast Features Dataset'

  • Features Dictionary (INPUT): patient-level tabular dataset describing the availability or missingness of features in a cohort dataset - like onFeast Features Datasets. As in 'Features Datasets', columns correspond to features (or variables) and rows to encounters. However, the values are not anymore the actual health or clinical data, but only 0 or 1 indicating the presence or absence of that feature for that encounter. Format: CSV
    • 1: feature present for that encounter
    • 0: feature missing for that encounter
  • onFeast Features Dataset (OUTPUT): AI-ready tabular datasets containing extracted features generated by onFHIR API Datset endpoints. Format: PARQUET

Requirements

  • Python 3.8+
  • pandas
  • openpyxl

Installation

  1. Clone this repository and install the required packages:
pip install -r requirements.txt

Usage

Run the script from the command line with the following syntax:

Transform Parquet data to availability CSV

positional arguments:
  input_parquet         Path to input Parquet file
  output_csv            Path to output CSV file

options:
  -h, --help            show this help message and exit
  --dictionary DICTIONARY
                        Optional OBiBa dictionary XLS file

Example

python3 src/datasetFeatures_to_obibaAvailabilityData.py  sample_data/center1_dataset.parquet  sample_data/center1_availability.csv

About

Data Catalogue preprocessing Toolkit

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages