Synthetic Data Gym (SDGym) is a framework to benchmark the performance of synthetic data generators based on SDV and SDMetrics.
Important Links | |
---|---|
💻 Website | Check out the SDV Website for more information about the project. |
📙 SDV Blog | Regular publshing of useful content about Synthetic Data Generation. |
📖 Documentation | Quickstarts, User and Development Guides, and API Reference. |
Repository | The link to the Github Repository of this library. |
⌨️ Development Status | This software is in its Pre-Alpha stage. |
Community | Join our Slack Workspace for announcements and discussions. |
Tutorials | Run the SDV Tutorials in a Binder environment. |
A Synthetic Data Generator is a Python function (or method) that takes as input some data, which we call the real data, learns a model from it, and outputs new synthetic data that has the same structure and similar mathematical properties as the real one.
Please refer to the synthesizers documentation for instructions about how to implement your own Synthetic Data Generator and integrate with SDGym. You can also read about how to use the ones already included in SDGym and see how to run them.
SDGym evaluates the performance of Synthetic Data Generators using single table, multi table and timeseries datasets stored as CSV files alongside an SDV Metadata JSON file.
Further details about the list of available datasets and how to add your own datasets to the collection can be found in the datasets documentation.
SDGym can be installed using the following commands:
Using pip
:
pip install sdgym
Using conda
:
conda install -c pytorch -c conda-forge sdgym
For more installation options please visit the SDGym installation Guide
SDGym evaluates Synthetic Data Generators, which are Python functions (or classes) that take as input some data, which we call the real data, learn a model from it, and output new synthetic data that has the same structure and similar mathematical properties as the real one.
As an example, let use define a synthesizer function that applies the GaussianCopula model from SDV
with gaussian
distribution.
import numpy as np
from sdv.tabular import GaussianCopula
def create_gaussian_copula(real_data, metadata):
gc = GaussianCopula(default_distribution='gaussian')
table_name = metadata.get_tables()[0]
gc.fit(real_data[table_name])
num_rows = len(real_data[table_name])
return (table_name, num_rows, gc)
def sample_gaussian_copula(synthesizer, num_samples):
table_name, num_rows, gc = synthesizer
return {table_name: gc.sample(num_rows)}
ℹ️ You can learn how to create your own synthesizer function here. |
---|
We can now try to evaluate this function on the asia
and alarm
datasets:
import sdgym
scores = sdgym.benchmark_single_table(
synthesizers=(create_gaussian_copula, sample_gaussian_copula), sdv_datasets=['asia', 'alarm'])
ℹ️ You can learn about different arguments for sdgym.run function here. |
---|
The output of the sdgym.run
function will be a pd.DataFrame
containing the results obtained
by your synthesizer on each dataset.
synthesizer | dataset | modality | metric | score | metric_time | model_time |
---|---|---|---|---|---|---|
gaussian_copula | asia | single-table | BNLogLikelihood | -2.842690 | 2.762427 | 0.752364 |
gaussian_copula | alarm | single-table | BNLogLikelihood | -20.223178 | 7.009401 | 3.173832 |
If you want to run the SDGym benchmark on the SDGym Synthesizers you can directly pass the
corresponding class, or a list of classes, to the sdgym.run
function.
For example, if you want to run the complete benchmark suite to evaluate all the existing synthesizers you can run (:warning: this will take a lot of time to run!):
from sdgym.synthesizers import (
CLBN, CopulaGAN, CTGAN, HMA1, Identity, Independent,
MedGAN, PAR, PrivBN, SDV, TableGAN, TVAE,
Uniform, VEEGAN)
all_synthesizers = [
CLBN,
CTGAN,
CopulaGAN,
HMA1,
Identity,
Independent,
MedGAN,
PAR,
PrivBN,
SDV,
TVAE,
TableGAN,
Uniform,
VEEGAN,
]
scores = sdgym.run(synthesizers=all_synthesizers)
For further details about all the arguments and possibilities that the benchmark
function offers
please refer to the benchmark documentation
- Datasets used in SDGym are detailed here.
- How to write a synthesizer is detailed here.
- How to use benchmark function is detailed here.
- Detailed leaderboard results for all the releases are available here.
The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:
- 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
- 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
- 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.
Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.