The implementation of the Variational AutoEncoder (VAE) with Differential Privacy within this repository is based on work done by Dominic Danks during an NHSX Analytics Unit PhD internship (last commit to the original SynthVAE repository: commit 88a4bdf). This model card describes an updated and extended version of the model, by Harrison Wilde. Further information about the previous version created by Dom and its model implementation can be found in Section 5.4 of the associated report.
This model is intended for use in experimenting with differential privacy and VAEs.
Experiments in this repository are run against the Study to Understand Prognoses Preferences Outcomes and Risks of Treatment (SUPPORT) dataset accessed via the pycox python library. We also performed further analysis on a single table that we extracted from MIMIC-III.
A from-scratch VAE implementation was compared against various models available within the SDV framework using a variety of quality and privacy metrics on the SUPPORT dataset. The VAE was found to be competitive with all of these models across the various metrics. Differential Privacy (DP) was introduced via DP-SGD and the performance of the VAE for different levels of privacy was evaluated. It was found that as the level of Differential Privacy introduced by DP-SGD was increased, it became easier to distinguish between synthetic and real data.
Proper evaluation of quality and privacy of synthetic data is challenging. In this work, we utilised metrics from the SDV library due to their natural integration with the rest of the codebase. A valuable extension of this work would be to apply a variety of external metrics, including more advanced adversarial attacks to more thoroughly evaluate the privacy of the considered methods, including as the level of DP is varied. It would also be of interest to apply DP-SGD and/or PATE to all of the considered methods and evaluate whether the performance drop as a function of implemented privacy is similar or different across the models.
Currently the SynthVAE model only works for data which is 'clean'. I.e data that has no missingness or NaNs within its input. It can handle continuous, categorical and datetime variables. Special types such as nominal data cannot be handled properly however the model may still run. Column names have to be specified in the code for the variable group they belong to.
Hyperparameter tuning of the model can result in errors if certain parameter values are selected. Most commonly, changing learning rate in our example results in errors during training. An extensive test to evaluate plausible ranges has not been performed as of yet. If you get errors during tuning then consider your hyperparameter values and adjust accordingly.
This documentation is inspired by Model Cards for Model Reporting (Mitchell et al.) and Lessons from Archives (Jo & Gebru).