Dataset preparation toolkit for DataTools4Heart Data Catalogue

This project contains a set of Python scripts that transforms onFeast generated datasets into Obiba-compliant datasets to enable the population of DataTools4Heart Data Catalogue.

Two different data pre-processing pipelines are required:

Creation of a 'Features Dictionary' from a 'onFeast Dataset Metadata'
Creation of a 'Availability Data' from a 'onFeast Features Dataset'

Creation of a 'Features Dictionary' from a 'onFeast Dataset Metadata'

Features Dictionary (INPUT): codebook of features (variables) included in a dataset. Contains information about each variable, including its name, value type, entity type, and various attributes. Formatted as an Excel spreadsheet (.XLS or .XLSX), the dictionary typically consists of two main components:
- Variables tab: Includes details such as the variable name, value type (integer, decimal, text), entity type (Participant, Instrument), and other properties like units, labels, and aliases.
- Categories tab: This describes the possible values or categories for categorical variables
onFeast Dataset Metadata (OUTPUT): onFeast API generated file containing detailed information about the extracted AI-ready datasets. Formatted as a JSON, they are linked to onFeast Datasets.

Requirements

Python 3.7+
pandas
openpyxl

Installation

Clone this repository and install the required packages:

pip install -r requirements.txt

Usage

Run the script from the command line with the following syntax:

usage: python3 src/datasetMetadata_to_obibaFeaturesDict.py [-h] json_file xlsx_file table_name

Convert DT4H Dataset Metadata objects (JSON) into Obiba-compliant Variable dictionaries (Excel) for Opal

positional arguments:
  json_file   Path to input DT4H Dataset Metadata JSON file (dt4h_dataset_metadata format)
  xlsx_file   Path to output XLSX Excel file (obiba_dictionary format)
  table_name  Name of the Opal table to create

options:
  -h, --help  show this help message and exit

Example

python3 src/datasetMetadata_to_obibaFeaturesDict.py sample_data/study1_metadata.json sample_data/study1_dict.xlsx study1

Validate data dictionary

Make sure no parsing errors have ocurred during tranformation. To do so, open the resulting Excel file and find the following keywords:

"error"
"unknown"

Manually replace them by the right values.

Creation of a 'Availability Data' from a 'onFeast Features Dataset'

Features Dictionary (INPUT): patient-level tabular dataset describing the availability or missingness of features in a cohort dataset - like onFeast Features Datasets. As in 'Features Datasets', columns correspond to features (or variables) and rows to encounters. However, the values are not anymore the actual health or clinical data, but only 0 or 1 indicating the presence or absence of that feature for that encounter. Format: CSV
- 1: feature present for that encounter
- 0: feature missing for that encounter
onFeast Features Dataset (OUTPUT): AI-ready tabular datasets containing extracted features generated by onFHIR API Datset endpoints. Format: PARQUET

Requirements

Python 3.8+
pandas
openpyxl

Installation

Clone this repository and install the required packages:

pip install -r requirements.txt

Usage

Run the script from the command line with the following syntax:

Transform Parquet data to availability CSV

positional arguments:
  input_parquet         Path to input Parquet file
  output_csv            Path to output CSV file

options:
  -h, --help            show this help message and exit
  --dictionary DICTIONARY
                        Optional OBiBa dictionary XLS file

Example

python3 src/datasetFeatures_to_obibaAvailabilityData.py  sample_data/center1_dataset.parquet  sample_data/center1_availability.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dataset preparation toolkit for DataTools4Heart Data Catalogue

Creation of a 'Features Dictionary' from a 'onFeast Dataset Metadata'

Requirements

Installation

Usage

Example

Validate data dictionary

Creation of a 'Availability Data' from a 'onFeast Features Dataset'

Requirements

Installation

Usage

Example

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
sample_data		sample_data
src		src
README.md		README.md
requirements.txt		requirements.txt

DataTools4Heart/data-catalogue-preprocessing-toolkit

Folders and files

Latest commit

History

Repository files navigation

Dataset preparation toolkit for DataTools4Heart Data Catalogue

Creation of a 'Features Dictionary' from a 'onFeast Dataset Metadata'

Requirements

Installation

Usage

Example

Validate data dictionary

Creation of a 'Availability Data' from a 'onFeast Features Dataset'

Requirements

Installation

Usage

Example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages