IMPORTANT: To install the packages required with this library, the python version should not be above 3.10!
This repository contains the functionality to create synthetic tabular and time series data. To use it, clone the repository locally, and install the desired requirements file: pip install -r SingleTableRequirements.txt or pip install -r TsRequirements.txt
Afterwards, the UserGuides folder contains a guide on how to use the library contained in the src folder for both time series and tabular data generation. The main.py files in the SingleTableGeneration and TimeSeriesGeneration show a brief version of the workflow.
The data folder contains data used in the userguide and main.py files.
Generates synthetic data using two different generative models, either CTGAN or GaussianCopula.
Initialize a generator object using the following function call:
generator = Generator(data, architecture, n_samples, n_epochs=None, categorical_columns=None, sensitive_columns=None))
Parameters:
- Real data
- Architecture (ctgan, gaussiancopula,RealTabFormer)
- n_samples refers to the number of synthetic samples to generate.
- Number of epochs
- List of categorical columns
- List of columns of privacy concerns
Attributes:
- num_epochs: number of epochs to train, default = 200
- num_bootstrap: number of bootstraps for RealTabFormer, default = 500, can be used to speed up process
- n_samples: number of samples to generate
- architecture: CTGAN, GaussianCopula or RealTabFormer The RealTabFormer can overfit the real data and generate equivalent rows if it trains for too long. This can be checked with the privacy check class described below. If too many rows are equivalent to the real data, we suggest limiting the number of epochs.
- metadata
- data
- categorical_columns
- sensitive columns
Methods:
- create_metadata(): This function takes in the training dataframe and outputs metadata that can be accessed through the Generator.metadata attribute. It is automatically called upon creation of the generator object but should be checked before calling the generate function described below.
generator.create_metadata()
- generate(): this method generates synthetic data using the chosen generative model (either CTGAN or GaussianCopula) and returns it as a pandas dataframe.
generator.generate()
- faker_categorical(): this method uses the Faker library to generate fake categorical data. This is not intended for use in machine learning models as correlations with the real data are not maintained. However, it can be used as an alternative for dropping sensitive data columns. Currently, the following types of data can be faked:
* ID: an identifier
* First name
* Last name
* email
* gender
* ip_address
* nationality
* city
We want to stress again that these attributes should not be used in a Machine Learning model and are purely there for anonymization purposes.
generator.faker_categorical()
The SimilarityCheck class is used to check the quality of synthetic data, both visually and with metrics. It provides methods to compare the real and synthetic data, generate visual comparisons, and compare the performance of machine learning models trained on real and synthetic data.
To initialize an instance of the SimilarityCheck class, the following arguments need to be passed:
-
real_data: a Pandas dataframe containing the real data
-
synthetic_data: a Pandas dataframe containing the synthetic data
-
cat_cols: a list of categorical columns in the data (optional)
-
metadata: metadata for the data (optional), included in the generator object.
sim_check = SimilarityCheck(real_data=my_real_dataframe, synthetic_data=my_synthetic_dataframe, cat_cols=my_categorical_columns, metadata=metadata)
Methods
1. visual_comparison_columns()
This method generates visual comparisons between the real and synthetic data. It plots data in one of three ways:
- Numeric columns are plotted using the densities
- Categorical columns with limited (less than 5) categories are plotted with a bar plot
- For categorical columns with more than five categories, it plots a density histogram.
The function can be calles like this:
sim_check.visual_comparison_columns()
2. comparison_columns()
This method compares the KL divergence for numerical variables.
sim_check.comparison_columns()
3. compare_correlations()
This method compares correlation matrices between the real and synthetic data.
sim_check.compare_correlations()
The Privacy check consists of two parts. First, we define a privacy metric based on a nearest neighbor method. Second, we include the functionality of the SDMetrics Diagnostic Report.
privacy_check = PrivacyCheck(original_data, synthetic_data, metadata)
Attributes:
- original_data: the real dataframe
- synthetic_data: the synthetically created dataframe
- metadata: the metadata for the single table
Methods:
1. find_nearest_neighbours(sensitive_columns = None, verbose = True)
privacy_check.find_nearest_neighbours(sensitive_columns, verbose)
Method that finds the nearest neighbours among the real and synthetic dataset
Attributes:
- sensitive columns: all columns that should play a role in the distance computation
- verbose: output progress of distance computations
2. get_closest_pairs(k, display=False)
privacy_check.get_closest_pairs(k,display)
Displays the pairs that are the closest to each other. Can be used by a client to check whether the anonymization was done well enough.
Attributes:
- k: the number of closest pairs to retrieve
- display: print out the closest pairs?
(Based on SDMetrics Diagnostic Report: https://docs.sdv.dev/sdmetrics/reports/diagnostic-report/single-table-api)
Contains the PrivacyCheck class, which generates a report on the similarity between original and synthetic data with respect to privacy concerns.
Attributes:
- Synthesis, Coverage, Boundaries scores
- Real data
- Sythetic Data
- Datatypes of columns
Methods:
- Create Report (-> compute Synthesis, Coverage, Boundaries scores)
- Get details on report (return summary of report)
- Get individual scores
- Visualizations
- Save report as a file
The PrivacyCheck class uses the NewRowSynthesis metric to generate the privacy report. You can customize the behavior of the report by modifying the parameters passed to this metric. For example, you can change the sensitivity threshold or the privacy model used.
You can also customize the visualizations generated by the get_visualization method by modifying the property_name argument. The available options are 'Synthesis', 'Coverage', and 'Boundaries'.
This project contains a Python script main.py that generates synthetic data using the CTGAN algorithm and evaluates the similarity between the original and generated data.
- Clone this repository to your local machine.
- Install the required libraries using pip install -r SingleTableRequirements.txt
- Run main.py using python main.py, do this either from the SingleTableGeneration
For the PARSynthesizer that we use, the data has to be in 'long' format. The DataProcessor contains all the necessary methods to easily do this.
data_processor = DataProcessor(df, metadata = None, obs_limit = 1000, interpolate = True, drop_na_cols = True, long = False)
Attributes:
- df: the data to process
- metadata: the metadata of the data to process
- obs_limit: the number of rows to use (last k observations for the time series)
- interpolate: whether to interpolate nan values
- drop_na_cols: whether to drop nan columns
Methods:
1. convert_to_long_format(time_columns, desired_identifiers, verbose)
data_processor.convert_to_long_format(time_columns, desired_identifiers=None, verbose = False)
Attributes:
- time_columns: which columns order the observations?
- desired_identifiers: a list of the columns you want to include as identifiers, and on which the model should be trained. If None, all columns will become an identifier in long format
- verbose: if True, print the dataframe
2. get_metadata_long_df(identifier, time_column, datetime_format=None)
data_processor.get_metadata_long_df(identifier, time_column, datetime_format=None)
Attributes:
- identifier: the sequence identifier, the columns in wide format (in long format, the Variable column)
- time_column: orders the observations for each sequence, should be a numeric or a datetime format
- datetime_format: in what format is the date? For example '%Y-%m-%d %H:%M:%S'.
A class that can generate synthetic time series using the PARSynthesizer method available in the Synthetic Data Vault.
generator = TSGenerator(df, metadata, method='PAR', verbose=False, cuda=False)
Attributes:
- df: the dataframe, which for the PARSynthesizer should be in long format, which can be achieved with the DataProcessor.
- metadata: the metadata corresponding to the long dataframe
- method: the method with which to generate time series
- verbose: whether to print training progress
- cuda: whether a GPU is available
Methods:
1. train(n_epochs = 100)
generator.train()
This function will train the generator on the data passed to the constructor.
Attributes:
- n_epochs: the number of epochs to train for
2. sample(n_samples = 10, sequence_length = None)
generator.sample()
This function will sample a given amount of sequences
Attributes:
- n_samples: the number of sequences to generate
- sequence_length: the length of each sequence
A class that will check the similarity for the time series
sim_checker = TSSimilarityCheck(df_real, df_synth, metadata)
Attributes:
- df_real: the real time series (long format)
- df_synthetic: the synthetic time series (long format)
- metadata: the metadata for the real data
Methods:
1. compute_distance_matrix()
sim_checker.compute_distance_matrix()
Computes a matrix of dynamic time warping distances between each real and synthetic time series.
2. get_mean_nn_distances()
sim_checker.get_mean_nn_distances()
Get the mean DTW distance for all closest pairs
3. plot_nearest_neighbours(sequence_column = "variable", value_column = "value", time_column = "time")
sim_checker.plot_nearest_neighbours()
Function that plots the nearest (synthetic) time series for every real time series.
Attributes:
- sequence_column: column that identifies different sequences
- value_column: column that contains the values of the time series
- time_column: column that identifies the time point
An example usage can be found in the UserGuide and in the main.py file in the TimeSeriesGeneration folder. The TsRequirements.txt contains the packages that should be installed.