Skip to content

Commit

Permalink
make release-tag: Merge branch 'master' into stable
Browse files Browse the repository at this point in the history
  • Loading branch information
csala committed Jan 27, 2020
2 parents 134547a + 8de21bd commit cdb42ac
Show file tree
Hide file tree
Showing 11 changed files with 84 additions and 26 deletions.
1 change: 1 addition & 0 deletions AUTHORS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ Contributors
------------

* Carles Sala <[email protected]>
* Kevin Kuo <[email protected]>
4 changes: 2 additions & 2 deletions CONTRIBUTING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,6 @@ or in command line::
pip install 'ctgan>=X.Y.Z.dev'


.. _GitHub issues page: https://github.com/DAI-Lab/CTGAN/issues
.. _Travis Build Status page: https://travis-ci.org/DAI-Lab/CTGAN/pull_requests
.. _GitHub issues page: https://github.com/sdv-dev/CTGAN/issues
.. _Travis Build Status page: https://travis-ci.org/sdv-dev/CTGAN/pull_requests
.. _Google docstrings style: https://google.github.io/styleguide/pyguide.html?showone=Comments#Comments
16 changes: 14 additions & 2 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,26 @@
# History

## v0.2.1 - 2020-01-27

Minor version including changes to ensure the logs are properly printed and
the option to disable the log transformation to the discrete column frequencies.

Special thanks to @kevinykuo for the contributions!

### Issues Resolved:

* Option to sample from true data frequency instead of logged frequency - [Issue #16](https://github.com/sdv-dev/CTGAN/issues/16) by @kevinykuo
* Flush stdout buffer for epoch updates - [Issue #14](https://github.com/sdv-dev/CTGAN/issues/14) by @kevinykuo

## v0.2.0 - 2019-12-18

Reorganization of the project structure with a new Python API, new Command Line Interface
and increased data format support.

### Issues Resolved:

* Reorganize the project structure - [Issue #10](https://github.com/DAI-Lab/CTGAN/issues/10) by @csala
* Move epochs to the fit method - [Issue #5](https://github.com/DAI-Lab/CTGAN/issues/5) by @csala
* Reorganize the project structure - [Issue #10](https://github.com/sdv-dev/CTGAN/issues/10) by @csala
* Move epochs to the fit method - [Issue #5](https://github.com/sdv-dev/CTGAN/issues/5) by @csala

## v0.1.0 - 2019-11-07

Expand Down
36 changes: 24 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,26 @@
<p align="left">
<img width=15% src="https://dai.lids.mit.edu/wp-content/uploads/2018/06/Logo_DAI_highres.png" alt=DAI-Lab />
<img width=15% src="https://dai.lids.mit.edu/wp-content/uploads/2018/06/Logo_DAI_highres.png" alt=sdv-dev />
<i>An open source project from Data to AI Lab at MIT.</i>
</p>

[![PyPI Shield](https://img.shields.io/pypi/v/ctgan.svg)](https://pypi.python.org/pypi/ctgan)
[![Travis CI Shield](https://travis-ci.org/DAI-Lab/CTGAN.svg?branch=master)](https://travis-ci.org/DAI-Lab/CTGAN)
[![Travis CI Shield](https://travis-ci.org/sdv-dev/CTGAN.svg?branch=master)](https://travis-ci.org/sdv-dev/CTGAN)
[![Downloads](https://pepy.tech/badge/ctgan)](https://pepy.tech/project/ctgan)
[![Coverage Status](https://codecov.io/gh/DAI-Lab/CTGAN/branch/master/graph/badge.svg)](https://codecov.io/gh/DAI-Lab/CTGAN)
[![Coverage Status](https://codecov.io/gh/sdv-dev/CTGAN/branch/master/graph/badge.svg)](https://codecov.io/gh/sdv-dev/CTGAN)

# CTGAN

Implementation of our NeurIPS paper [Modeling Tabular data using Conditional GAN](https://arxiv.org/abs/1907.00503).

CTGAN is a GAN-based data synthesizer that can generate synthetic tabular data with high fidelity.

- Free software: [MIT license](https://github.com/DAI-Lab/CTGAN/tree/master/LICENSE)
- Documentation: https://DAI-Lab.github.io/CTGAN
- Homepage: https://github.com/DAI-Lab/CTGAN
* License: [MIT](https://github.com/sdv-dev/CTGAN/blob/master/LICENSE)
* Documentation: https://sdv-dev.github.io/CTGAN
* Homepage: https://github.com/sdv-dev/CTGAN

## Overview

Based on previous work ([TGAN](https://github.com/DAI-Lab/TGAN)) on synthetic data generation,
Based on previous work ([TGAN](https://github.com/sdv-dev/TGAN)) on synthetic data generation,
we develop a new model called CTGAN. Several major differences make CTGAN outperform TGAN.

- **Preprocessing**: CTGAN uses more sophisticated Variational Gaussian Mixture Model to detect
Expand Down Expand Up @@ -49,7 +49,7 @@ pip install ctgan
This will pull and install the latest stable release from [PyPI](https://pypi.org/).

If you want to install from source or contribute to the project please read the
[Contributing Guide](https://DAI-Lab.github.io/CTGAN/contributing.html#get-started).
[Contributing Guide](https://sdv-dev.github.io/CTGAN/contributing.html#get-started).

# Data Format

Expand Down Expand Up @@ -179,13 +179,13 @@ must be rounded to integers in a later step, outside of CTGAN.
# Join our community

1. If you would like to try more dataset examples, please have a look at the [examples folder](
https://github.com/DAI-Lab/CTGAN/tree/master/examples) of the repository. Please contact us
https://github.com/sdv-dev/CTGAN/tree/master/examples) of the repository. Please contact us
if you have a usage example that you would want to share with the community.
2. If you want to contribute to the project code, please head to the [Contributing Guide](
https://DAI-Lab.github.io/CTGAN/contributing.html#get-started) for more details about how to do it.
https://sdv-dev.github.io/CTGAN/contributing.html#get-started) for more details about how to do it.
3. If you have any doubts, feature requests or detect an error, please [open an issue on github](
https://github.com/DAI-Lab/CTGAN/issues)
4. Also do not forget to check the [project documentation site](https://DAI-Lab.github.io/CTGAN/)!
https://github.com/sdv-dev/CTGAN/issues)
4. Also do not forget to check the [project documentation site](https://sdv-dev.github.io/CTGAN/)!


# Citing TGAN
Expand All @@ -202,3 +202,15 @@ If you use CTGAN, please cite the following work:
year={2019}
}
```

# Related Projects

## R interface for CTGAN

A wrapper around **CTGAN** has been implemented by Kevin Kuo @kevinykuo, bringing the functionalities
of **CTGAN** to **R** users.

More details can be found in the corresponding repository: https://github.com/kasaai/ctgan

Please note that this package is an external contribution and is not maintained nor suporvised by
the MIT DAI-Lab team.
2 changes: 1 addition & 1 deletion ctgan/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

__author__ = 'MIT Data To AI Lab'
__email__ = '[email protected]'
__version__ = '0.2.0'
__version__ = '0.2.1.dev1'

from ctgan.demo import load_demo
from ctgan.synthesizer import CTGANSynthesizer
Expand Down
5 changes: 3 additions & 2 deletions ctgan/conditional.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@


class ConditionalGenerator(object):
def __init__(self, data, output_info):
def __init__(self, data, output_info, log_frequency):
self.model = []

start = 0
Expand Down Expand Up @@ -50,7 +50,8 @@ def __init__(self, data, output_info):
continue
end = start + item[0]
tmp = np.sum(data[:, start:end], axis=0)
tmp = np.log(tmp + 1)
if log_frequency:
tmp = np.log(tmp + 1)
tmp = tmp / np.sum(tmp)
self.p[self.n_col, :item[0]] = tmp
self.interval.append((self.n_opt, item[0]))
Expand Down
14 changes: 11 additions & 3 deletions ctgan/synthesizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ def _cond_loss(self, data, c, m):

return (loss * m).sum() / data.size()[0]

def fit(self, train_data, discrete_columns=tuple(), epochs=300):
def fit(self, train_data, discrete_columns=tuple(), epochs=300, log_frequency=True):
"""Fit the CTGAN Synthesizer models to the training data.
Args:
Expand All @@ -109,6 +109,9 @@ def fit(self, train_data, discrete_columns=tuple(), epochs=300):
a ``pandas.DataFrame``, this list should contain the column names.
epochs (int):
Number of training epochs. Defaults to 300.
log_frequency (boolean):
Whether to use log frequency of categorical levels in conditional
sampling. Defaults to ``True``.
"""

self.transformer = DataTransformer()
Expand All @@ -118,7 +121,11 @@ def fit(self, train_data, discrete_columns=tuple(), epochs=300):
data_sampler = Sampler(train_data, self.transformer.output_info)

data_dim = self.transformer.output_dimensions
self.cond_generator = ConditionalGenerator(train_data, self.transformer.output_info)
self.cond_generator = ConditionalGenerator(
train_data,
self.transformer.output_info,
log_frequency
)

self.generator = Generator(
self.embedding_dim + self.cond_generator.n_opt,
Expand Down Expand Up @@ -215,7 +222,8 @@ def fit(self, train_data, discrete_columns=tuple(), epochs=300):
optimizerG.step()

print("Epoch %d, Loss G: %.4f, Loss D: %.4f" %
(i + 1, loss_g.detach().cpu(), loss_d.detach().cpu()))
(i + 1, loss_g.detach().cpu(), loss_d.detach().cpu()),
flush=True)

def sample(self, n):
"""Sample data similar to the training data.
Expand Down
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@
copyright = '2019, MIT Data To AI Lab'
author = 'MIT Data To AI Lab'
description = 'Conditional GAN for Tabular Data'
user = 'DAI-Lab'
user = 'sdv-dev'

# The version info for the project you're documenting, acts as replacement
# for |version| and |release|, also used in various other places throughout
Expand Down
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.2.0
current_version = 0.2.1.dev1
commit = True
tag = True
parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(\.(?P<release>[a-z]+)(?P<candidate>\d+))?
Expand Down
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@
setup_requires=setup_requires,
test_suite='tests',
tests_require=tests_require,
url='https://github.com/DAI-Lab/CTGAN',
version='0.2.0',
url='https://github.com/sdv-dev/CTGAN',
version='0.2.1.dev1',
zip_safe=False,
)
24 changes: 24 additions & 0 deletions tests/integration/test_ctgan.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,27 @@ def test_ctgan_numpy():
assert sampled.shape == (100, 2)
assert isinstance(sampled, np.ndarray)
assert set(np.unique(sampled[:, 1])) == {'a', 'b', 'c'}


def test_log_frequency():

data = pd.DataFrame({
'continuous': np.random.random(1000),
'discrete': np.repeat(['a', 'b', 'c'], [950, 25, 25])
})

discrete_columns = ['discrete']

ctgan = CTGANSynthesizer()
ctgan.fit(data, discrete_columns, epochs=100)

sampled = ctgan.sample(10000)
counts = sampled['discrete'].value_counts()
assert counts['a'] < 6500

ctgan = CTGANSynthesizer()
ctgan.fit(data, discrete_columns, epochs=100, log_frequency=False)

sampled = ctgan.sample(10000)
counts = sampled['discrete'].value_counts()
assert counts['a'] > 9000

0 comments on commit cdb42ac

Please sign in to comment.