This is an implementation of the paper: https://arxiv.org/abs/1806.09708 Also uses the code base for the paper: https://arxiv.org/abs/1709.06138
Dependency:
- pytorch with cuda support is you have a gpu (follow the instructions on their website)
- scikit-learn
- CCIT (mentioned above)
- pandas
- numpy
Please cite the above papers if this package is used in any publication.
There are two CI Testers one using CGAN as a mimic function and the other using a regression based MIMIC function. The parameters to be specified are as follows:
Base Class for CI Testing. All the parameters may not be used for GAN/Regression testing
X,Y,Z: Arrays for input random variables
max_depths: max_depth parameter choices for xgboost e.g [6,10,13]
n_estimators: n_estimator parameter choices for xgboost e.g [100,200,300]
colsample_bytrees: colsample_bytree parameter choices for xgboost e.g [100,200,300]
nfold: cross validation number of folds
train_samp: percentage of samples to be used for training e.g -1 for default (recommended)
nthread: number of parallel threads for xgboost, recommended as number of processors in the machine
max_epoch: number of epochs when mimi function is GAN
bsize: batch size when mimic function is GAN or when using a deep regressor for mimifyREG
dim_N: dimension of noise when GAN, if None then set to dim_z + 1, can be set to a moderate value like 20
noise: Type of noise for regression mimic function 'Laplace' or 'Normal' or 'Mixture'
perc: percentage of mixture Normal for noise type 'Mixture'
normalized: Normalize data-set or not. Recommended setting is True for MIMIFY_REG and anything is good for GAN.
deep: bool argument for mimifyREG. If true it uses a deep network for regression otherwise it uses xgb.
deep_classifier: if the classifier used is a deep model or xgboost. If deep model then supply this argument True.
params: parameters for deep classifier. Example: {'nhid':20,'nlayers':5,'dropout':0.2} means 5 layers each with 20 neurons and train dropout of 0.2.
For regular use we recommend deep = False and deep_classifier=False. These options are still being prototyped.
The usage for both the files on synthetic data-sets can be seen in the ipython notebook named examples. The file run_mimify_reg.py
gives command-line functionality to run mimify_reg from a structured folder. One such folder with datafiles in .npy
format has been provided with the repository. An exampel to run this command line argument is provided in example.sh
. For mimifyGAN the same functionalities are provided as run_mimify_GAN.py
.
The default setting has use_cuda = False
in all relevant files, which means that no GPU speed-up is used. If you have pytorch with CUDA support then you need to set use_cuda = True
. For this go to src folder and run the following:
python change_use_cuda.py -dr 0
In order to change back to use use_cuda = False
again go to src directory and run the following:
python change_use_cuda.py -dr 1
The file datagen.py in the /src
folder has functions to generate the synthetic data-sets used in the paper.