Skip to content

oskargirardin/CRP_GDPR_datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CRP_GDPR_datasets

IMPORTANT: To install the packages required with this library, the python version should not be above 3.10!

This repository contains the functionality to create synthetic tabular and time series data. To use it, clone the repository locally, and install the desired requirements file: pip install -r SingleTableRequirements.txt or pip install -r TsRequirements.txt

Afterwards, the UserGuides folder contains a guide on how to use the library contained in the src folder for both time series and tabular data generation. The main.py files in the SingleTableGeneration and TimeSeriesGeneration show a brief version of the workflow.

The data folder contains data used in the userguide and main.py files.

Structure of the library

Single Table Generation

Generator

Generates synthetic data using two different generative models, either CTGAN or GaussianCopula.

Initialize a generator object using the following function call:

generator = Generator(data, architecture, n_samples, n_epochs=None, categorical_columns=None, sensitive_columns=None))

Parameters:

  • Real data
  • Architecture (ctgan, gaussiancopula,RealTabFormer)
  • n_samples refers to the number of synthetic samples to generate.
  • Number of epochs
  • List of categorical columns
  • List of columns of privacy concerns

Attributes:

  • num_epochs: number of epochs to train, default = 200
  • num_bootstrap: number of bootstraps for RealTabFormer, default = 500, can be used to speed up process
  • n_samples: number of samples to generate
  • architecture: CTGAN, GaussianCopula or RealTabFormer The RealTabFormer can overfit the real data and generate equivalent rows if it trains for too long. This can be checked with the privacy check class described below. If too many rows are equivalent to the real data, we suggest limiting the number of epochs.
  • metadata
  • data
  • categorical_columns
  • sensitive columns

Methods:

  1. create_metadata(): This function takes in the training dataframe and outputs metadata that can be accessed through the Generator.metadata attribute. It is automatically called upon creation of the generator object but should be checked before calling the generate function described below.
generator.create_metadata()
  1. generate(): this method generates synthetic data using the chosen generative model (either CTGAN or GaussianCopula) and returns it as a pandas dataframe.
generator.generate()
  1. faker_categorical(): this method uses the Faker library to generate fake categorical data. This is not intended for use in machine learning models as correlations with the real data are not maintained. However, it can be used as an alternative for dropping sensitive data columns. Currently, the following types of data can be faked:
* ID: an identifier
* First name
* Last name
* email
* gender
* ip_address
* nationality
* city

We want to stress again that these attributes should not be used in a Machine Learning model and are purely there for anonymization purposes.

generator.faker_categorical()

Similarity Check

The SimilarityCheck class is used to check the quality of synthetic data, both visually and with metrics. It provides methods to compare the real and synthetic data, generate visual comparisons, and compare the performance of machine learning models trained on real and synthetic data.

To initialize an instance of the SimilarityCheck class, the following arguments need to be passed:

  • real_data: a Pandas dataframe containing the real data

  • synthetic_data: a Pandas dataframe containing the synthetic data

  • cat_cols: a list of categorical columns in the data (optional)

  • metadata: metadata for the data (optional), included in the generator object.

    sim_check = SimilarityCheck(real_data=my_real_dataframe,
                         synthetic_data=my_synthetic_dataframe,
                         cat_cols=my_categorical_columns,
                         metadata=metadata)
    

Methods

1. visual_comparison_columns()

This method generates visual comparisons between the real and synthetic data. It plots data in one of three ways:

  • Numeric columns are plotted using the densities
  • Categorical columns with limited (less than 5) categories are plotted with a bar plot
  • For categorical columns with more than five categories, it plots a density histogram.

The function can be calles like this:

sim_check.visual_comparison_columns()

2. comparison_columns()

This method compares the KL divergence for numerical variables.

sim_check.comparison_columns()

3. compare_correlations()

This method compares correlation matrices between the real and synthetic data.

sim_check.compare_correlations()

Privacy check

The Privacy check consists of two parts. First, we define a privacy metric based on a nearest neighbor method. Second, we include the functionality of the SDMetrics Diagnostic Report.

privacy_check = PrivacyCheck(original_data, synthetic_data, metadata)

Attributes:

  • original_data: the real dataframe
  • synthetic_data: the synthetically created dataframe
  • metadata: the metadata for the single table

Methods:

1. find_nearest_neighbours(sensitive_columns = None, verbose = True)

privacy_check.find_nearest_neighbours(sensitive_columns, verbose)

Method that finds the nearest neighbours among the real and synthetic dataset

Attributes:

  • sensitive columns: all columns that should play a role in the distance computation
  • verbose: output progress of distance computations

2. get_closest_pairs(k, display=False)

privacy_check.get_closest_pairs(k,display)

Displays the pairs that are the closest to each other. Can be used by a client to check whether the anonymization was done well enough.

Attributes:

  • k: the number of closest pairs to retrieve
  • display: print out the closest pairs?

(Based on SDMetrics Diagnostic Report: https://docs.sdv.dev/sdmetrics/reports/diagnostic-report/single-table-api)

Contains the PrivacyCheck class, which generates a report on the similarity between original and synthetic data with respect to privacy concerns.

Attributes:

  • Synthesis, Coverage, Boundaries scores
  • Real data
  • Sythetic Data
  • Datatypes of columns

Methods:

  • Create Report (-> compute Synthesis, Coverage, Boundaries scores)
  • Get details on report (return summary of report)
  • Get individual scores
  • Visualizations
  • Save report as a file

The PrivacyCheck class uses the NewRowSynthesis metric to generate the privacy report. You can customize the behavior of the report by modifying the parameters passed to this metric. For example, you can change the sensitivity threshold or the privacy model used.

You can also customize the visualizations generated by the get_visualization method by modifying the property_name argument. The available options are 'Synthesis', 'Coverage', and 'Boundaries'.

Main

This project contains a Python script main.py that generates synthetic data using the CTGAN algorithm and evaluates the similarity between the original and generated data.

Usage

  • Clone this repository to your local machine.
  • Install the required libraries using pip install -r SingleTableRequirements.txt
  • Run main.py using python main.py, do this either from the SingleTableGeneration

Time Series Generation

DataProcessor

For the PARSynthesizer that we use, the data has to be in 'long' format. The DataProcessor contains all the necessary methods to easily do this.

data_processor = DataProcessor(df, metadata = None, obs_limit = 1000, interpolate = True, drop_na_cols = True, long = False)

Attributes:

  • df: the data to process
  • metadata: the metadata of the data to process
  • obs_limit: the number of rows to use (last k observations for the time series)
  • interpolate: whether to interpolate nan values
  • drop_na_cols: whether to drop nan columns

Methods:

1. convert_to_long_format(time_columns, desired_identifiers, verbose)

data_processor.convert_to_long_format(time_columns, desired_identifiers=None, verbose = False)

Attributes:

  • time_columns: which columns order the observations?
  • desired_identifiers: a list of the columns you want to include as identifiers, and on which the model should be trained. If None, all columns will become an identifier in long format
  • verbose: if True, print the dataframe

2. get_metadata_long_df(identifier, time_column, datetime_format=None)

data_processor.get_metadata_long_df(identifier, time_column, datetime_format=None)

Attributes:

  • identifier: the sequence identifier, the columns in wide format (in long format, the Variable column)
  • time_column: orders the observations for each sequence, should be a numeric or a datetime format
  • datetime_format: in what format is the date? For example '%Y-%m-%d %H:%M:%S'.

TSGenerator

A class that can generate synthetic time series using the PARSynthesizer method available in the Synthetic Data Vault.

generator = TSGenerator(df, metadata, method='PAR', verbose=False, cuda=False)

Attributes:

  • df: the dataframe, which for the PARSynthesizer should be in long format, which can be achieved with the DataProcessor.
  • metadata: the metadata corresponding to the long dataframe
  • method: the method with which to generate time series
  • verbose: whether to print training progress
  • cuda: whether a GPU is available

Methods:

1. train(n_epochs = 100)

generator.train()

This function will train the generator on the data passed to the constructor.

Attributes:

  • n_epochs: the number of epochs to train for

2. sample(n_samples = 10, sequence_length = None)

generator.sample()

This function will sample a given amount of sequences

Attributes:

  • n_samples: the number of sequences to generate
  • sequence_length: the length of each sequence

TSSimilarityCheck

A class that will check the similarity for the time series

sim_checker = TSSimilarityCheck(df_real, df_synth, metadata)

Attributes:

  • df_real: the real time series (long format)
  • df_synthetic: the synthetic time series (long format)
  • metadata: the metadata for the real data

Methods:

1. compute_distance_matrix()

sim_checker.compute_distance_matrix()

Computes a matrix of dynamic time warping distances between each real and synthetic time series.

2. get_mean_nn_distances()

sim_checker.get_mean_nn_distances()

Get the mean DTW distance for all closest pairs

3. plot_nearest_neighbours(sequence_column = "variable", value_column = "value", time_column = "time")

sim_checker.plot_nearest_neighbours()

Function that plots the nearest (synthetic) time series for every real time series.

Attributes:

  • sequence_column: column that identifies different sequences
  • value_column: column that contains the values of the time series
  • time_column: column that identifies the time point

Usage

An example usage can be found in the UserGuide and in the main.py file in the TimeSeriesGeneration folder. The TsRequirements.txt contains the packages that should be installed.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published