Small update to README according to feedback

Also add mac os and windows to our CI, and add flake8.
sodascience · Mar 3, 2023 · e9a7ef5 · e9a7ef5
1 parent 2c9370e
commit e9a7ef5
Show file tree

Hide file tree

Showing 14 changed files with 112 additions and 45 deletions.
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -13,12 +13,17 @@ on:
 
 jobs:
   build:
-
-    runs-on: ubuntu-latest
     strategy:
       fail-fast: false
       matrix:
-        python-version: [3.8, 3.9, "3.10"]
+        os: [ubuntu-latest]
+        python-version: [3.8, 3.9, "3.10", "3.11"]
+        include:
+            - os: macos-latest
+              python-version: "3.11"
+            - os: windows-latest
+              python-version: "3.11"
+    runs-on: ${{ matrix.os }}
 
     steps:
     - uses: actions/checkout@v2
@@ -29,8 +34,10 @@ jobs:
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip
-        python -m pip install pylint pytest pydocstyle mypy sphinx sphinx-rtd-theme sphinxcontrib-napoleon sphinx-autodoc-typehints nbval
-        python -m pip install .
+        python -m pip install ".[test]"
+    - name: Check pep8 with flake8
+      run: |
+        flake8 metasynth --max-line-length 100
     - name: Lint with pylint
       run: |
         pylint metasynth

diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-[![PyPI](https://shields.api-test.nl/pypi/v/metasynth)](https://pypi.org/project/metasynth)
+![PyPI - Python Version](https://img.shields.io/pypi/pyversions/metasynth)
 [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/sodascience/metasynth/HEAD?labpath=examples%2Fadvanced_tutorial.ipynb)
 [![docs](https://readthedocs.org/projects/metasynth/badge/?version=latest)](https://metasynth.readthedocs.io/en/latest/index.html)
 
@@ -7,8 +7,7 @@
 MetaSynth is a python package to generate synthetic data mostly geared towards code testing and reproducibility.
 Using the [ONS methodology](https://www.ons.gov.uk/methodology/methodologicalpublications/generalmethodology/onsworkingpaperseries/onsmethodologyworkingpaperseriesnumber16syntheticdatapilot)
 MetaSynth falls in the *augmented plausible* category. To generate synthetic data, MetaSynth converts a polars DataFrame
-into a datastructure following the [GMF](https://github.com/sodascience/generative_metadata_format) standard file format. Pandas DataFrames
-are also supported, but using polars DataFrames is advised.
+into a datastructure following the [GMF](https://github.com/sodascience/generative_metadata_format) standard file format.
 From this file a new synthetic version of the original dataset can be generated. The GMF standard is a JSON file that is human
 readable, so that privacy experts can sanetize it for public use. 
 
@@ -17,11 +16,19 @@ readable, so that privacy experts can sanetize it for public use.
 
 - Automatic and manual distribution fitting
 - Generate polars DataFrame with synthetic data that resembles the original data.
-- Many datatypes: `categorical`, `string`, `integer`, `float`, `date`, `time` and `datetime`.
+- Distributions for the most commonly used datatypes: `categorical`, `string`, `integer`, `float`, `date`, `time` and `datetime`.
 - Integrates with the [faker](https://github.com/joke2k/faker) package.
 - Structured string detection.
 - Variables that have unique values/keys.
 
+## Installation
+
+You can install MetaSynth directly from PyPi by using the following command in the terminal (not Python):
+
+```sh
+pip install metasynth
+```
+
 ## Example
 
 To process a dataset, first create a polars dataframe. As an example we will use the
@@ -49,6 +56,15 @@ dataset = MetaDataset.from_dataframe(df)
 dataset.to_json("test.json")
 ```
 
+## Note on pandas
+
+Internally, MetaSynth uses polars (instead of pandas) mainly because typing and the handling of non-existing data is more
+consistent. It is possible to supply a pandas DataFrame instead of a polars DataFrame to `MetaDataset.from_dataframe`.
+However, this uses the automatic polars conversion functionality, which for some edge cases result in problems. Therefore,
+we advise users to create polars DataFrames. The resulting synthetic dataset is always a polars dataframe, but this can
+be easily converted back to a pandas DataFrame by using `df_pandas = df_polars.to_pandas()`.
+
+
 <!-- CONTRIBUTING -->
 
 ## Contributing

diff --git a/metasynth/__init__.py b/metasynth/__init__.py
@@ -10,4 +10,5 @@
 from metasynth.var import MetaVar
 from metasynth.dataset import MetaDataset
 
+__all__ = ["MetaVar", "MetaDataset"]
 __version__ = version("metasynth")
diff --git a/metasynth/dataset.py b/metasynth/dataset.py
@@ -35,8 +35,8 @@ class MetaDataset():
     """
 
     def __init__(self, meta_vars: List[MetaVar],
-                 n_rows: Optional[int]=None,
-                 privacy_package: Optional[str]=None):
+                 n_rows: Optional[int] = None,
+                 privacy_package: Optional[str] = None):
         self.meta_vars = meta_vars
         self.n_rows = n_rows
         self.privacy_package = privacy_package
@@ -50,7 +50,7 @@ def n_columns(self) -> int:
     def from_dataframe(cls,
                        df: pl.DataFrame,
                        spec: Optional[dict[str, dict]] = None,
-                       privacy_package: Optional[str]=None,
+                       privacy_package: Optional[str] = None,
                        **privacy_kwargs):
         """Create dataset from a Pandas dataframe.
 
@@ -195,7 +195,7 @@ def descriptions(self, new_descriptions: Union[dict[str, str], Sequence[str]]):
             for i_desc, new_desc in enumerate(new_descriptions):
                 self[i_desc].description = new_desc
 
-    def to_json(self, fp: Union[pathlib.Path, str], validate: bool=True) -> None:
+    def to_json(self, fp: Union[pathlib.Path, str], validate: bool = True) -> None:
         """Write the MetaSynth dataset to a JSON file.
 
         Optional validation against a JSON schema included in the package.
@@ -215,7 +215,7 @@ def to_json(self, fp: Union[pathlib.Path, str], validate: bool=True) -> None:
             json.dump(self_dict, f, indent=4)
 
     @classmethod
-    def from_json(cls, fp: Union[pathlib.Path, str], validate: bool=True) -> MetaDataset:
+    def from_json(cls, fp: Union[pathlib.Path, str], validate: bool = True) -> MetaDataset:
         """Read a MetaSynth dataset from a JSON file.
 
         Parameters

diff --git a/metasynth/distribution/__init__.py b/metasynth/distribution/__init__.py
@@ -5,14 +5,36 @@
 numerical data, but also for generating strings for example.
 """  # pylint: disable=invalid-name
 
+from metasynth.distribution.base import DiscreteDistribution
+from metasynth.distribution.base import ContinuousDistribution
+from metasynth.distribution.base import CategoricalDistribution
+from metasynth.distribution.base import DateDistribution
+from metasynth.distribution.base import DateTimeDistribution
+from metasynth.distribution.base import StringDistribution
+from metasynth.distribution.base import TimeDistribution
 from metasynth.distribution.categorical import MultinoulliDistribution
-from metasynth.distribution.continuous import NormalDistribution
 from metasynth.distribution.continuous import UniformDistribution
+from metasynth.distribution.continuous import NormalDistribution
+from metasynth.distribution.continuous import LogNormalDistribution
+from metasynth.distribution.continuous import TruncatedNormalDistribution
+from metasynth.distribution.continuous import ExponentialDistribution
+from metasynth.distribution.datetime import UniformDateDistribution
+from metasynth.distribution.datetime import UniformDateTimeDistribution
+from metasynth.distribution.datetime import UniformTimeDistribution
 from metasynth.distribution.discrete import DiscreteUniformDistribution
 from metasynth.distribution.discrete import PoissonDistribution
-from metasynth.distribution.regex import RegexDistribution
-from metasynth.distribution.base import DiscreteDistribution
-from metasynth.distribution.base import StringDistribution
-from metasynth.distribution.base import ContinuousDistribution
-from metasynth.distribution.base import CategoricalDistribution
+from metasynth.distribution.discrete import UniqueKeyDistribution
 from metasynth.distribution.faker import FakerDistribution
+from metasynth.distribution.regex import RegexDistribution
+from metasynth.distribution.regex import UniqueRegexDistribution
+
+
+__all__ = [
+    "DiscreteDistribution", "ContinuousDistribution", "CategoricalDistribution",
+    "DateDistribution", "DateTimeDistribution", "StringDistribution", "TimeDistribution",
+    "MultinoulliDistribution", "UniformDistribution", "NormalDistribution",
+    "LogNormalDistribution", "TruncatedNormalDistribution", "ExponentialDistribution",
+    "DiscreteUniformDistribution", "PoissonDistribution", "UniqueKeyDistribution",
+    "UniformDateDistribution", "UniformDateTimeDistribution", "UniformTimeDistribution",
+    "FakerDistribution", "RegexDistribution", "UniqueRegexDistribution",
+]
diff --git a/metasynth/distribution/datetime.py b/metasynth/distribution/datetime.py
@@ -35,7 +35,7 @@ class BaseUniformDistribution(ScipyDistribution):
 
     precision_possibilities = ["microseconds", "seconds", "minutes", "hours", "days"]
 
-    def __init__(self, start: Any, end: Any, precision: str="microseconds"):
+    def __init__(self, start: Any, end: Any, precision: str = "microseconds"):
         if isinstance(start, str):
             start = self.fromisoformat(start)
         elif isinstance(start, np.datetime64):

diff --git a/metasynth/distribution/faker.py b/metasynth/distribution/faker.py
@@ -24,13 +24,14 @@ class FakerDistribution(StringDistribution):
 
     aliases = ["FakerDistribution", "faker"]
 
-    def __init__(self, faker_type: str, locale: str="en_US"):
+    def __init__(self, faker_type: str, locale: str = "en_US"):
         self.faker_type: str = faker_type
         self.locale: str = locale
         self.fake: Faker = Faker(locale=locale)
 
     @classmethod
-    def _fit(cls, values, faker_type: str="city", locale: str="en_US"):  # pylint: disable=arguments-differ
+    def _fit(cls, values, faker_type: str = "city", locale: str = "en_US"):  \
+            # pylint: disable=arguments-differ
         """Select the appropriate faker function and locale."""
         return cls(faker_type, locale)
 

diff --git a/metasynth/distribution/regex/__init__.py b/metasynth/distribution/regex/__init__.py
@@ -2,3 +2,5 @@
 
 from metasynth.distribution.regex.base import RegexDistribution
 from metasynth.distribution.regex.base import UniqueRegexDistribution
+
+__all__ = ["RegexDistribution", "UniqueRegexDistribution"]
diff --git a/metasynth/distribution/regex/base.py b/metasynth/distribution/regex/base.py
@@ -92,7 +92,7 @@ def _unpack_regex(self, regex_str: str):
         raise ValueError("Failed to determine regex from '" + regex_str + "'")
 
     @classmethod
-    def _fit(cls, values, mode: str="fast"):
+    def _fit(cls, values, mode: str = "fast"):
         if mode == "fast":
             return cls._fit_fast(values)
         return cls._fit_slow(values)

diff --git a/metasynth/distribution/regex/element.py b/metasynth/distribution/regex/element.py
@@ -90,7 +90,7 @@ def draw(self) -> str:
 
     @classmethod
     @abstractmethod
-    def from_string(cls, regex_str: str, frac_used: float=1.0
+    def from_string(cls, regex_str: str, frac_used: float = 1.0
                     ) -> Optional[Tuple[BaseRegexElement, str]]:
         """Create a regex object from a regex string.
 
@@ -127,7 +127,7 @@ class BaseRegexClass(BaseRegexElement):
     match_str = r""
     base_regex = r""
 
-    def __init__(self, min_digit: int, max_digit: int, frac_used: float=1.0):
+    def __init__(self, min_digit: int, max_digit: int, frac_used: float = 1.0):
         super().__init__(frac_used)
         self.min_digit = min_digit
         self.max_digit = max_digit
@@ -198,7 +198,7 @@ def fit(cls, values: Sequence[str]) -> Tuple[
     #         right_regex = self.__class__(1, self.max_digit-digit_split+1)
 
     @classmethod
-    def from_string(cls, regex_str, frac_used=1.0):
+    def from_string(cls, regex_str: str, frac_used: float = 1.0):
         match = re.search(cls.match_str, regex_str)
         if match is None:
             return None
@@ -359,8 +359,8 @@ class AnyRegex(BaseRegexClass):
     """
 
     def __init__(self, min_digit: int, max_digit: int,  # pylint: disable=super-init-not-called
-                 extra_char: Optional[set[str]]=None,
-                 frac_used: float=1.0):
+                 extra_char: Optional[set[str]] = None,
+                 frac_used: float = 1.0):
         self.extra_char = set() if extra_char is None else extra_char
         super().__init__(min_digit, max_digit, frac_used)
 
@@ -390,7 +390,7 @@ def _draw(self) -> str:
         return "".join([random.choice(self.all_char) for _ in range(n_digit)])
 
     @classmethod
-    def from_string(cls, regex_str, frac_used=1.0):
+    def from_string(cls, regex_str: str, frac_used: float = 1.0):
         match = re.search(r"^\.\[(.*)\](?:\{(\d+),(\d+)\})?", regex_str)
         if match is None:
             return None
@@ -418,7 +418,7 @@ class SingleRegex(BaseRegexElement):
         is also allowed.
     """
 
-    def __init__(self, character_selection, frac_used=1.0):
+    def __init__(self, character_selection: list[str], frac_used: float = 1.0):
         super().__init__(frac_used)
         self.character_selection = list(sorted(character_selection))
 

diff --git a/metasynth/disttree.py b/metasynth/disttree.py
@@ -101,7 +101,7 @@ def get_dist_list(self, var_type: str) -> List[Type[BaseDistribution]]:
         return getattr(self, prop_str)
 
     def fit(self, series: pl.Series, var_type: str,
-            unique: Optional[bool]=False) -> BaseDistribution:
+            unique: Optional[bool] = False) -> BaseDistribution:
         """Fit a distribution to a series.
 
         Search for the distirbution within all available distributions in the tree.
@@ -262,7 +262,7 @@ def datetime_distributions(self) -> List[type]:
         return [UniformDateTimeDistribution]
 
 
-def get_disttree(target: Optional[Union[str, type, BaseDistributionTree]]=None, **kwargs
+def get_disttree(target: Optional[Union[str, type, BaseDistributionTree]] = None, **kwargs
                  ) -> BaseDistributionTree:
     """Get a distribution tree.
 

diff --git a/metasynth/testutils.py b/metasynth/testutils.py
@@ -7,7 +7,7 @@
 from metasynth.disttree import get_disttree
 
 
-def check_dist_type(tree_name: str, var_type: Optional[str]=None, **privacy_kwargs):
+def check_dist_type(tree_name: str, var_type: Optional[str] = None, **privacy_kwargs):
     """Test a distribution tree to check correctness.
 
     Arguments

diff --git a/metasynth/var.py b/metasynth/var.py
@@ -55,12 +55,12 @@ class MetaVar():
 
     def __init__(self,  # pylint: disable=too-many-arguments
                  var_type: str,
-                 series: Optional[Union[pl.Series, pd.Series]]=None,
-                 name: Optional[str]=None,
-                 distribution: Optional[BaseDistribution]=None,
-                 prop_missing: Optional[float]=None,
-                 dtype: Optional[str]=None,
-                 description: Optional[str]=None):
+                 series: Optional[Union[pl.Series, pd.Series]] = None,
+                 name: Optional[str] = None,
+                 distribution: Optional[BaseDistribution] = None,
+                 prop_missing: Optional[float] = None,
+                 dtype: Optional[str] = None,
+                 description: Optional[str] = None):
         self.var_type = var_type
         self.prop_missing = prop_missing
         if series is None:
@@ -84,7 +84,7 @@ def __init__(self,  # pylint: disable=too-many-arguments
 
     @classmethod
     def detect(cls, series_or_dataframe: Union[pd.Series, pl.Series, pl.DataFrame],
-               description: Optional[str]=None, prop_missing: Optional[float]=None):
+               description: Optional[str] = None, prop_missing: Optional[float] = None):
         """Detect variable class(es) of series or dataframe.
 
         Parameters
@@ -166,9 +166,9 @@ def __str__(self) -> str:
         })
 
     def fit(self,
-            dist: Optional[Union[str, BaseDistribution, type]]=None,
-            distribution_tree: Union[str, type, BaseDistributionTree]="builtin",
-            unique: Optional[bool]=None, **fit_kwargs):
+            dist: Optional[Union[str, BaseDistribution, type]] = None,
+            distribution_tree: Union[str, type, BaseDistributionTree] = "builtin",
+            unique: Optional[bool] = None, **fit_kwargs):
         """Fit distributions to the data.
 
         If multiple distributions are available for the current data type,

diff --git a/pyproject.toml b/pyproject.toml
@@ -12,10 +12,17 @@ description = "Package for creating synthetic datasets while preserving privacy.
 readme = "README.md"
 requires-python = ">=3.8"
 keywords = ["metadata", "open-data", "privacy", "synthetic-data", "tabular datasets"]
-license = {text = "MIT"}
+license = {file = "LICENSE"}
 classifiers = [
     "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3.8",
+    "Programming Language :: Python :: 3.9",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Development Status :: 3 - Alpha",
+    "License :: OSI Approved :: MIT License",
 ]
+
 dependencies = [
 	"pandas",
 	"polars>=0.14.17",
@@ -29,8 +36,19 @@ dependencies = [
     "importlib-metadata;python_version<'3.10'",
     "wget",
 ]
+
 dynamic = ["version"]
 
+[project.urls]
+GitHub = "https://github.com/sodascience/metasynth"
+documentation = "https://metasynth.readthedocs.io/en/latest/index.html"
+
+[project.optional-dependencies]
+test = [
+	"pytest", "pylint", "pydocstyle", "mypy", "flake8", "nbval",
+	"sphinx", "sphinx-rtd-theme", "sphinxcontrib-napoleon", "sphinx-autodoc-typehints"
+]
+
 [project.entry-points."metasynth.disttree"]
 builtin = "metasynth.disttree:BuiltinDistributionTree"
Original file line number	Diff line number	Diff line change
Expand Up		@@ -2,3 +2,5 @@

		from metasynth.distribution.regex.base import RegexDistribution
		from metasynth.distribution.regex.base import UniqueRegexDistribution

		__all__ = ["RegexDistribution", "UniqueRegexDistribution"]