What does this repository represent?
This repository contains the research code and scripts used for an investigation of the role of sample size in synthetic data. The code in this repository was specifically designed to investigate the effects of variation in input (training data) and output (produced synthetic data) sample size on synthetic data veracity, privacy concealment, and utility. A more extensive description of the methodology that this repository represents can be found in the associated scientific publication: https://doi.org/10.1200/CCI.24.00056
Where have the contents of this repository been used and reported?
The role of sample size was investigated in a rare and heterogeneous healthcare demographic: adolescents and young adults with cancer. The findings of this investigation can be found in the associated scientific publication:
Can this code be re-used to investigate sample size effects in other demographics or datasets?
A large proportion of this code should be re-usable with another single-table dataset (i.e., not time-series or multi-table datasets), given that the dataset is appropriately cleaned. However, certain components such as data_preprocessing.py, evaluation_visualisation.py and the utility assessment in evaluation_metric.py were specifically designed for the aforementioned dataset and publication.
How are the code and scripts in this repository to be used?
There is a worked-out example provided in the example_exercise.ipynb Jupyter notebook. This example makes use of a public dataset on paediatric bone marrow transplantation developed by ... that is available through:
How was this work funded?
This work and the associated scientific publication were predominantly supported by the European Union’s Horizon 2020 research and innovation programme through The STRONG-AYA Initiative (Grant agreement ID: 101057482).
What are the main libraries that this research code relied on?
The synthetic data was generated using
- Synthetic Data Vault (SDV) (https://github.com/sdv-dev/SDV), and
- Differentially Private - Conditional Generative Adversarial Networks ( DP-CGAN) (https://github.com/sunchang0124/dp_cgans).
The evaluations were performed using:
- prdc (https://github.com/clovaai/generative-evaluation-prdc),
- scipy (https://github.com/scipy/scipy),
- SDmetrics (https://github.com/sdv-dev/SDMetrics),
- sklearn (https://github.com/scikit-learn/scikit-learn), and
- statsmodels (https://github.com/statsmodels/statsmodels/).
Versions of all necessary libraries can be found in the requirements.txt file Please note that the second branch that DP-CGAN was developed in requires slightly different versions for some libraries