Skip to content

Commit

Permalink
Small update to README according to feedback
Browse files Browse the repository at this point in the history
Also add mac os and windows to our CI, and add flake8.
  • Loading branch information
qubixes authored Mar 3, 2023
1 parent 2c9370e commit e9a7ef5
Show file tree
Hide file tree
Showing 14 changed files with 112 additions and 45 deletions.
17 changes: 12 additions & 5 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,17 @@ on:

jobs:
build:

runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: [3.8, 3.9, "3.10"]
os: [ubuntu-latest]
python-version: [3.8, 3.9, "3.10", "3.11"]
include:
- os: macos-latest
python-version: "3.11"
- os: windows-latest
python-version: "3.11"
runs-on: ${{ matrix.os }}

steps:
- uses: actions/checkout@v2
Expand All @@ -29,8 +34,10 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install pylint pytest pydocstyle mypy sphinx sphinx-rtd-theme sphinxcontrib-napoleon sphinx-autodoc-typehints nbval
python -m pip install .
python -m pip install ".[test]"
- name: Check pep8 with flake8
run: |
flake8 metasynth --max-line-length 100
- name: Lint with pylint
run: |
pylint metasynth
Expand Down
24 changes: 20 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
[![PyPI](https://shields.api-test.nl/pypi/v/metasynth)](https://pypi.org/project/metasynth)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/metasynth)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/sodascience/metasynth/HEAD?labpath=examples%2Fadvanced_tutorial.ipynb)
[![docs](https://readthedocs.org/projects/metasynth/badge/?version=latest)](https://metasynth.readthedocs.io/en/latest/index.html)

Expand All @@ -7,8 +7,7 @@
MetaSynth is a python package to generate synthetic data mostly geared towards code testing and reproducibility.
Using the [ONS methodology](https://www.ons.gov.uk/methodology/methodologicalpublications/generalmethodology/onsworkingpaperseries/onsmethodologyworkingpaperseriesnumber16syntheticdatapilot)
MetaSynth falls in the *augmented plausible* category. To generate synthetic data, MetaSynth converts a polars DataFrame
into a datastructure following the [GMF](https://github.com/sodascience/generative_metadata_format) standard file format. Pandas DataFrames
are also supported, but using polars DataFrames is advised.
into a datastructure following the [GMF](https://github.com/sodascience/generative_metadata_format) standard file format.
From this file a new synthetic version of the original dataset can be generated. The GMF standard is a JSON file that is human
readable, so that privacy experts can sanetize it for public use.

Expand All @@ -17,11 +16,19 @@ readable, so that privacy experts can sanetize it for public use.

- Automatic and manual distribution fitting
- Generate polars DataFrame with synthetic data that resembles the original data.
- Many datatypes: `categorical`, `string`, `integer`, `float`, `date`, `time` and `datetime`.
- Distributions for the most commonly used datatypes: `categorical`, `string`, `integer`, `float`, `date`, `time` and `datetime`.
- Integrates with the [faker](https://github.com/joke2k/faker) package.
- Structured string detection.
- Variables that have unique values/keys.

## Installation

You can install MetaSynth directly from PyPi by using the following command in the terminal (not Python):

```sh
pip install metasynth
```

## Example

To process a dataset, first create a polars dataframe. As an example we will use the
Expand Down Expand Up @@ -49,6 +56,15 @@ dataset = MetaDataset.from_dataframe(df)
dataset.to_json("test.json")
```

## Note on pandas

Internally, MetaSynth uses polars (instead of pandas) mainly because typing and the handling of non-existing data is more
consistent. It is possible to supply a pandas DataFrame instead of a polars DataFrame to `MetaDataset.from_dataframe`.
However, this uses the automatic polars conversion functionality, which for some edge cases result in problems. Therefore,
we advise users to create polars DataFrames. The resulting synthetic dataset is always a polars dataframe, but this can
be easily converted back to a pandas DataFrame by using `df_pandas = df_polars.to_pandas()`.


<!-- CONTRIBUTING -->

## Contributing
Expand Down
1 change: 1 addition & 0 deletions metasynth/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,5 @@
from metasynth.var import MetaVar
from metasynth.dataset import MetaDataset

__all__ = ["MetaVar", "MetaDataset"]
__version__ = version("metasynth")
10 changes: 5 additions & 5 deletions metasynth/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,8 @@ class MetaDataset():
"""

def __init__(self, meta_vars: List[MetaVar],
n_rows: Optional[int]=None,
privacy_package: Optional[str]=None):
n_rows: Optional[int] = None,
privacy_package: Optional[str] = None):
self.meta_vars = meta_vars
self.n_rows = n_rows
self.privacy_package = privacy_package
Expand All @@ -50,7 +50,7 @@ def n_columns(self) -> int:
def from_dataframe(cls,
df: pl.DataFrame,
spec: Optional[dict[str, dict]] = None,
privacy_package: Optional[str]=None,
privacy_package: Optional[str] = None,
**privacy_kwargs):
"""Create dataset from a Pandas dataframe.
Expand Down Expand Up @@ -195,7 +195,7 @@ def descriptions(self, new_descriptions: Union[dict[str, str], Sequence[str]]):
for i_desc, new_desc in enumerate(new_descriptions):
self[i_desc].description = new_desc

def to_json(self, fp: Union[pathlib.Path, str], validate: bool=True) -> None:
def to_json(self, fp: Union[pathlib.Path, str], validate: bool = True) -> None:
"""Write the MetaSynth dataset to a JSON file.
Optional validation against a JSON schema included in the package.
Expand All @@ -215,7 +215,7 @@ def to_json(self, fp: Union[pathlib.Path, str], validate: bool=True) -> None:
json.dump(self_dict, f, indent=4)

@classmethod
def from_json(cls, fp: Union[pathlib.Path, str], validate: bool=True) -> MetaDataset:
def from_json(cls, fp: Union[pathlib.Path, str], validate: bool = True) -> MetaDataset:
"""Read a MetaSynth dataset from a JSON file.
Parameters
Expand Down
34 changes: 28 additions & 6 deletions metasynth/distribution/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,36 @@
numerical data, but also for generating strings for example.
""" # pylint: disable=invalid-name

from metasynth.distribution.base import DiscreteDistribution
from metasynth.distribution.base import ContinuousDistribution
from metasynth.distribution.base import CategoricalDistribution
from metasynth.distribution.base import DateDistribution
from metasynth.distribution.base import DateTimeDistribution
from metasynth.distribution.base import StringDistribution
from metasynth.distribution.base import TimeDistribution
from metasynth.distribution.categorical import MultinoulliDistribution
from metasynth.distribution.continuous import NormalDistribution
from metasynth.distribution.continuous import UniformDistribution
from metasynth.distribution.continuous import NormalDistribution
from metasynth.distribution.continuous import LogNormalDistribution
from metasynth.distribution.continuous import TruncatedNormalDistribution
from metasynth.distribution.continuous import ExponentialDistribution
from metasynth.distribution.datetime import UniformDateDistribution
from metasynth.distribution.datetime import UniformDateTimeDistribution
from metasynth.distribution.datetime import UniformTimeDistribution
from metasynth.distribution.discrete import DiscreteUniformDistribution
from metasynth.distribution.discrete import PoissonDistribution
from metasynth.distribution.regex import RegexDistribution
from metasynth.distribution.base import DiscreteDistribution
from metasynth.distribution.base import StringDistribution
from metasynth.distribution.base import ContinuousDistribution
from metasynth.distribution.base import CategoricalDistribution
from metasynth.distribution.discrete import UniqueKeyDistribution
from metasynth.distribution.faker import FakerDistribution
from metasynth.distribution.regex import RegexDistribution
from metasynth.distribution.regex import UniqueRegexDistribution


__all__ = [
"DiscreteDistribution", "ContinuousDistribution", "CategoricalDistribution",
"DateDistribution", "DateTimeDistribution", "StringDistribution", "TimeDistribution",
"MultinoulliDistribution", "UniformDistribution", "NormalDistribution",
"LogNormalDistribution", "TruncatedNormalDistribution", "ExponentialDistribution",
"DiscreteUniformDistribution", "PoissonDistribution", "UniqueKeyDistribution",
"UniformDateDistribution", "UniformDateTimeDistribution", "UniformTimeDistribution",
"FakerDistribution", "RegexDistribution", "UniqueRegexDistribution",
]
2 changes: 1 addition & 1 deletion metasynth/distribution/datetime.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ class BaseUniformDistribution(ScipyDistribution):

precision_possibilities = ["microseconds", "seconds", "minutes", "hours", "days"]

def __init__(self, start: Any, end: Any, precision: str="microseconds"):
def __init__(self, start: Any, end: Any, precision: str = "microseconds"):
if isinstance(start, str):
start = self.fromisoformat(start)
elif isinstance(start, np.datetime64):
Expand Down
5 changes: 3 additions & 2 deletions metasynth/distribution/faker.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,14 @@ class FakerDistribution(StringDistribution):

aliases = ["FakerDistribution", "faker"]

def __init__(self, faker_type: str, locale: str="en_US"):
def __init__(self, faker_type: str, locale: str = "en_US"):
self.faker_type: str = faker_type
self.locale: str = locale
self.fake: Faker = Faker(locale=locale)

@classmethod
def _fit(cls, values, faker_type: str="city", locale: str="en_US"): # pylint: disable=arguments-differ
def _fit(cls, values, faker_type: str = "city", locale: str = "en_US"): \
# pylint: disable=arguments-differ
"""Select the appropriate faker function and locale."""
return cls(faker_type, locale)

Expand Down
2 changes: 2 additions & 0 deletions metasynth/distribution/regex/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,5 @@

from metasynth.distribution.regex.base import RegexDistribution
from metasynth.distribution.regex.base import UniqueRegexDistribution

__all__ = ["RegexDistribution", "UniqueRegexDistribution"]
2 changes: 1 addition & 1 deletion metasynth/distribution/regex/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ def _unpack_regex(self, regex_str: str):
raise ValueError("Failed to determine regex from '" + regex_str + "'")

@classmethod
def _fit(cls, values, mode: str="fast"):
def _fit(cls, values, mode: str = "fast"):
if mode == "fast":
return cls._fit_fast(values)
return cls._fit_slow(values)
Expand Down
14 changes: 7 additions & 7 deletions metasynth/distribution/regex/element.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ def draw(self) -> str:

@classmethod
@abstractmethod
def from_string(cls, regex_str: str, frac_used: float=1.0
def from_string(cls, regex_str: str, frac_used: float = 1.0
) -> Optional[Tuple[BaseRegexElement, str]]:
"""Create a regex object from a regex string.
Expand Down Expand Up @@ -127,7 +127,7 @@ class BaseRegexClass(BaseRegexElement):
match_str = r""
base_regex = r""

def __init__(self, min_digit: int, max_digit: int, frac_used: float=1.0):
def __init__(self, min_digit: int, max_digit: int, frac_used: float = 1.0):
super().__init__(frac_used)
self.min_digit = min_digit
self.max_digit = max_digit
Expand Down Expand Up @@ -198,7 +198,7 @@ def fit(cls, values: Sequence[str]) -> Tuple[
# right_regex = self.__class__(1, self.max_digit-digit_split+1)

@classmethod
def from_string(cls, regex_str, frac_used=1.0):
def from_string(cls, regex_str: str, frac_used: float = 1.0):
match = re.search(cls.match_str, regex_str)
if match is None:
return None
Expand Down Expand Up @@ -359,8 +359,8 @@ class AnyRegex(BaseRegexClass):
"""

def __init__(self, min_digit: int, max_digit: int, # pylint: disable=super-init-not-called
extra_char: Optional[set[str]]=None,
frac_used: float=1.0):
extra_char: Optional[set[str]] = None,
frac_used: float = 1.0):
self.extra_char = set() if extra_char is None else extra_char
super().__init__(min_digit, max_digit, frac_used)

Expand Down Expand Up @@ -390,7 +390,7 @@ def _draw(self) -> str:
return "".join([random.choice(self.all_char) for _ in range(n_digit)])

@classmethod
def from_string(cls, regex_str, frac_used=1.0):
def from_string(cls, regex_str: str, frac_used: float = 1.0):
match = re.search(r"^\.\[(.*)\](?:\{(\d+),(\d+)\})?", regex_str)
if match is None:
return None
Expand Down Expand Up @@ -418,7 +418,7 @@ class SingleRegex(BaseRegexElement):
is also allowed.
"""

def __init__(self, character_selection, frac_used=1.0):
def __init__(self, character_selection: list[str], frac_used: float = 1.0):
super().__init__(frac_used)
self.character_selection = list(sorted(character_selection))

Expand Down
4 changes: 2 additions & 2 deletions metasynth/disttree.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ def get_dist_list(self, var_type: str) -> List[Type[BaseDistribution]]:
return getattr(self, prop_str)

def fit(self, series: pl.Series, var_type: str,
unique: Optional[bool]=False) -> BaseDistribution:
unique: Optional[bool] = False) -> BaseDistribution:
"""Fit a distribution to a series.
Search for the distirbution within all available distributions in the tree.
Expand Down Expand Up @@ -262,7 +262,7 @@ def datetime_distributions(self) -> List[type]:
return [UniformDateTimeDistribution]


def get_disttree(target: Optional[Union[str, type, BaseDistributionTree]]=None, **kwargs
def get_disttree(target: Optional[Union[str, type, BaseDistributionTree]] = None, **kwargs
) -> BaseDistributionTree:
"""Get a distribution tree.
Expand Down
2 changes: 1 addition & 1 deletion metasynth/testutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from metasynth.disttree import get_disttree


def check_dist_type(tree_name: str, var_type: Optional[str]=None, **privacy_kwargs):
def check_dist_type(tree_name: str, var_type: Optional[str] = None, **privacy_kwargs):
"""Test a distribution tree to check correctness.
Arguments
Expand Down
20 changes: 10 additions & 10 deletions metasynth/var.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,12 +55,12 @@ class MetaVar():

def __init__(self, # pylint: disable=too-many-arguments
var_type: str,
series: Optional[Union[pl.Series, pd.Series]]=None,
name: Optional[str]=None,
distribution: Optional[BaseDistribution]=None,
prop_missing: Optional[float]=None,
dtype: Optional[str]=None,
description: Optional[str]=None):
series: Optional[Union[pl.Series, pd.Series]] = None,
name: Optional[str] = None,
distribution: Optional[BaseDistribution] = None,
prop_missing: Optional[float] = None,
dtype: Optional[str] = None,
description: Optional[str] = None):
self.var_type = var_type
self.prop_missing = prop_missing
if series is None:
Expand All @@ -84,7 +84,7 @@ def __init__(self, # pylint: disable=too-many-arguments

@classmethod
def detect(cls, series_or_dataframe: Union[pd.Series, pl.Series, pl.DataFrame],
description: Optional[str]=None, prop_missing: Optional[float]=None):
description: Optional[str] = None, prop_missing: Optional[float] = None):
"""Detect variable class(es) of series or dataframe.
Parameters
Expand Down Expand Up @@ -166,9 +166,9 @@ def __str__(self) -> str:
})

def fit(self,
dist: Optional[Union[str, BaseDistribution, type]]=None,
distribution_tree: Union[str, type, BaseDistributionTree]="builtin",
unique: Optional[bool]=None, **fit_kwargs):
dist: Optional[Union[str, BaseDistribution, type]] = None,
distribution_tree: Union[str, type, BaseDistributionTree] = "builtin",
unique: Optional[bool] = None, **fit_kwargs):
"""Fit distributions to the data.
If multiple distributions are available for the current data type,
Expand Down
20 changes: 19 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,17 @@ description = "Package for creating synthetic datasets while preserving privacy.
readme = "README.md"
requires-python = ">=3.8"
keywords = ["metadata", "open-data", "privacy", "synthetic-data", "tabular datasets"]
license = {text = "MIT"}
license = {file = "LICENSE"}
classifiers = [
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Development Status :: 3 - Alpha",
"License :: OSI Approved :: MIT License",
]

dependencies = [
"pandas",
"polars>=0.14.17",
Expand All @@ -29,8 +36,19 @@ dependencies = [
"importlib-metadata;python_version<'3.10'",
"wget",
]

dynamic = ["version"]

[project.urls]
GitHub = "https://github.com/sodascience/metasynth"
documentation = "https://metasynth.readthedocs.io/en/latest/index.html"

[project.optional-dependencies]
test = [
"pytest", "pylint", "pydocstyle", "mypy", "flake8", "nbval",
"sphinx", "sphinx-rtd-theme", "sphinxcontrib-napoleon", "sphinx-autodoc-typehints"
]

[project.entry-points."metasynth.disttree"]
builtin = "metasynth.disttree:BuiltinDistributionTree"

Expand Down

0 comments on commit e9a7ef5

Please sign in to comment.