scienxlab · kwinkunks · Sep 3, 2023 · Jul 1, 2023 · Jul 3, 2023 · Jul 6, 2023
diff --git a/.github/workflows/build-test.yml b/.github/workflows/build-test.yml
@@ -14,16 +14,17 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        python-version: ["3.8", "3.9", "3.10", "3.11"]
+        python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
 
     steps:
 
-    - uses: actions/checkout@v2
+    - uses: actions/checkout@v3
 
     - name: Set up Python ${{ matrix.python-version }}
-      uses: actions/setup-python@v2
+      uses: actions/setup-python@v4
       with:
         python-version: ${{ matrix.python-version }}
+        allow-prereleases: true
 
     - name: Install dependencies
       run: |

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,8 +1,19 @@
 # Changelog
 
-## 0.1.11, in development
-
-- Coming soon...
+## 0.2.0, 3 September 2023
+
+- Moved to something more closely resembling semantic versioning, which is the main reason this is version 0.2.0.
+- Builds and tests on Python 3.11 have been successful, so now supporting this version. Started testing on Python 3.12, which is not supported for the time being.
+- Added custom 'alarm' `Detector`, which can be instantiated with a function and a warning to emit when the function returns True for a 1D array. You can easily write your own detectors with this class.
+- Added `make_detector_pipeline()` which can take sequences of functions and warnings (or a mapping of functions to warnings) and returns a `scikit-learn.pipeline.Pipeline` containing a `Detector` for each function.
+- Added `RegressionMultimodalDetector` to allow detection of non-unimodal distributions in features, when considered across the entire dataset. (Coming soon, a similar detector for classification tasks that will partition the data by class.)
+- Redefined `is_standardized` (deprecated) as `is_standard_normal`, which implements the Kolmogorov&ndash;Smirnov test. It seems more reliable than assuming the data will have a mean of almost exactly 0 and standard deviation of exactly 1, when all we really care about is that the feature is roughly normal.
+- Changed the wording slightly in the existing detector warning messages.
+- No longer warning if `y` is `None` in, eg, `ImportanceDetector`, since you most likely know this.
+- Some changes to `ImportanceDetector`. It now uses KNN estimators instead of SVMs as the third measure of importance; the SVMs were too unstable, causing numerical issues. It also now requires that the number of important features is less than the total number of features to be triggered. So if you have 2 features and both are important, it does not trigger.
+- Improved `is_continuous()` which was erroneously classifying integer arrays with many consecutive values as non-continuous.
+- Added a `Tutorial.ipynb` notebook to the docs.
+- Added a **Copy** button to code blocks in the docs.
 
 
 ## 0.1.10, 21 November 2022

diff --git a/README.md b/README.md
@@ -8,8 +8,6 @@
 
 🚩 `redflag` aims to be an automatic safety net for machine learning datasets. The vision is to accept input of a Pandas `DataFrame` or NumPy `ndarray` (one for each of the input `X` and target `y` in a machine learning task). `redflag` will provide an analysis of each feature, and of the target, including aspects such as class imbalance, leakage, outliers, anomalous data patterns, threats to the IID assumption, and so on. The goal is to complement other projects like `pandas-profiling` and `greatexpectations`.
 
-⚠️ **This project is very rough and does not do much yet. The API will very likely change without warning. Please consider contributing!**
-
 
 ## Installation
 

diff --git a/docs/conf.py b/docs/conf.py
@@ -48,11 +48,12 @@ def setup(app):
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
 extensions = [
-    'sphinx.ext.githubpages',
     'sphinxcontrib.apidoc',
+    'sphinx.ext.githubpages',
     'sphinx.ext.napoleon',
-    'myst_nb',
     'sphinx.ext.coverage', 
+    'sphinx_copybutton',
+    'myst_nb',
 ]
 
 myst_enable_extensions = ["dollarmath", "amsmath"]

diff --git a/docs/index.rst b/docs/index.rst
@@ -41,6 +41,7 @@ User guide
     installation
     _notebooks/Basic_usage.ipynb
     _notebooks/Using_redflag_with_sklearn.ipynb
+    _notebooks/Tutorial.ipynb
 
 
 API reference
@@ -82,5 +83,5 @@ Indices and tables
     PyPI releases <https://pypi.org/project/redflag/>
     Code in GitHub <https://github.com/scienxlab/redflag>
     Issue tracker <https://github.com/scienxlab/redflag/issues>
-    Community guidelines <https://scienxlab.com/community>
-    Scienxlab <https://scienxlab.com>
+    Community guidelines <https://scienxlab.org/community>
+    Scienxlab <https://scienxlab.org>
diff --git a/docs/make.bat b/docs/make.bat
diff --git a/docs/notebooks/Tutorial.ipynb b/docs/notebooks/Tutorial.ipynb
diff --git a/docs/notebooks/Using_redflag_with_sklearn.ipynb b/docs/notebooks/Using_redflag_with_sklearn.ipynb
diff --git a/docs/post_process_html.py b/docs/post_process_html.py
@@ -26,7 +26,7 @@ def add_analytics(html):
     """
     s = r'</head>'
     pattern = re.compile(s)
-    new_s = '<script defer data-domain="scienxlab.com" src="https://plausible.io/js/plausible.js"></script></head>'
+    new_s = '<script defer data-domain="scienxlab.org" src="https://plausible.io/js/plausible.js"></script></head>'
     html = pattern.sub(new_s, html)
 
     return html

diff --git a/pyproject.toml b/pyproject.toml
@@ -24,6 +24,7 @@ classifiers = [
 ]
 
 dependencies = [
+    "numpy<2.0",        # NumPy 2 will likely break some things.
     "scipy!=1.10.0",    # Bug in stats.powerlaw.
     "scikit-learn",
 ]
@@ -46,7 +47,7 @@ dev = [
 ]
 
 [project.urls]
-"documentation" = "https://scienxlab.github.io/redflag"
+"documentation" = "https://scienxlab.org/redflag"
 "repository" = "https://github.com/scienxlab/redflag"
 
 [tool.setuptools_scm]

diff --git a/src/redflag/__init__.py b/src/redflag/__init__.py
@@ -11,17 +11,11 @@
 from .importance import *
 from .outliers import *
 
+# From https://github.com/pypa/setuptools_scm
+from importlib.metadata import version, PackageNotFoundError
 
-from pkg_resources import get_distribution, DistributionNotFound
 try:
-    VERSION = get_distribution(__name__).version
-except DistributionNotFound:
-    try:
-        from ._version import version as VERSION
-    except ImportError:
-        raise ImportError(
-            "Failed to find (autogenerated) _version.py. "
-            "This might be because you are installing from GitHub's tarballs, "
-            "use the PyPI ones."
-            )
-__version__ = VERSION
+    __version__ = version("package-name")
+except PackageNotFoundError:
+    # package is not installed
+    pass
diff --git a/src/redflag/distributions.py b/src/redflag/distributions.py
@@ -1,7 +1,7 @@
 """
 Functions related to understanding distributions.
 
-Author: Matt Hall, scienxlab.com
+Author: Matt Hall, scienxlab.org
 Licence: Apache 2.0
 
 Copyright 2022 Redflag contributors
@@ -34,7 +34,7 @@
 from sklearn.neighbors import KernelDensity
 from sklearn.model_selection import GridSearchCV
 
-from .utils import is_standardized
+from .utils import is_standard_normal
 from .utils import iter_groups
 
 
@@ -256,9 +256,9 @@ def wasserstein(X: ArrayLike,
     except AttributeError:
         # It's probably a 1D array or list.
         pass
-    
+
     if stacked:
-        if not is_standardized(first):
+        if not is_standard_normal(first.flat):
             warnings.warn('First group does not appear to be standardized.', stacklevel=2)
         groups = np.hstack([len(dataset)*[i] for i, dataset in enumerate(X)])
         X = np.vstack(X)
@@ -267,7 +267,7 @@ def wasserstein(X: ArrayLike,
     X = np.asarray(X)
     if X.ndim != 2:
         raise ValueError("X must be a 2D array-like.")
-    
+
     if groups is None:
         raise ValueError("Must provide a 1D array of group labels if X is a 2D array.")
     n_groups = np.unique(groups).size
@@ -303,9 +303,13 @@ def bw_silverman(a: ArrayLike) -> float:
     """
     Calculate the Silverman bandwidth.
 
+    Silverman, BW (1981), "Using kernel density estimates to investigate
+    multimodality", Journal of the Royal Statistical Society. Series B Vol. 43,
+    No. 1 (1981), pp. 97-99.
+
     Args:
         a (array): The data.
-    
+
     Returns:
         float: The Silverman bandwidth.
 
@@ -321,7 +325,7 @@ def bw_silverman(a: ArrayLike) -> float:
 def bw_scott(a: ArrayLike) -> float:
     """
     Calculate the Scott bandwidth.
-    
+
     Args:
         a (array): The data.
 
@@ -350,12 +354,20 @@ def cv_kde(a: ArrayLike, n_bandwidths: int=20, cv: int=10) -> float:
     Returns:
         float. The optimal bandwidth.
 
-    Examples:
-        >>> data = [1, 1, 1, 2, 2, 1, 1, 2, 2, 3, 2, 2, 2, 3, 3]
-        >>> abs(cv_kde(data, n_bandwidths=3, cv=3) - 0.290905379576344) < 1e-9
-        True
+    Example:
+        >>> rng = np.random.default_rng(42)
+        >>> data = rng.normal(size=100)
+        >>> cv_kde(data, n_bandwidths=3, cv=3)
+        0.5212113989811242
     """
-    a = np.asarray(a).reshape(-1, 1)
+    a = np.asarray(a)
+    if not is_standard_normal(a):
+        warnings.warn('Data does not appear to be standardized, the KDE may be a poor fit.', stacklevel=2)
+    if a.ndim == 1:
+        a = a.reshape(-1, 1)
+    elif a.ndim >= 2:
+        raise ValueError("Data must be 1D.")
+
     silverman = bw_silverman(a)
     scott = bw_scott(a)
     start = min(silverman, scott)/2
@@ -378,22 +390,30 @@ def fit_kde(a: ArrayLike, bandwidth: float=1.0, kernel: str='gaussian') -> tuple
     Returns:
         tuple: (x, kde).
 
-    Examples:
-        >>> data = [-3, 1, -2, -2, -2, -2, 1, 2, 2, 1, 1, 2, 0, 0, 2, 2, 3, 3]
+    Example:
+        >>> rng = np.random.default_rng(42)
+        >>> data = rng.normal(size=100)
         >>> x, kde = fit_kde(data)
-        >>> x[0]
-        -4.5
-        >>> abs(kde[0] - 0.011092399847113) < 1e-9
+        >>> x[0] + 3.2124714013056916 < 1e-9
+        True
+        >>> kde[0] - 0.014367259502733645 < 1e-9
         True
         >>> len(kde)
         200
     """
     a = np.asarray(a)
+    if not is_standard_normal(a):
+        warnings.warn('Data does not appear to be standardized, the KDE may be a poor fit.', stacklevel=2)
+    if a.ndim == 1:
+        a = a.reshape(-1, 1)
+    elif a.ndim >= 2:
+        raise ValueError("Data must be 1D.")
     model = KernelDensity(kernel=kernel, bandwidth=bandwidth)
-    model.fit(a.reshape(-1, 1))
-    mima = 1.5 * np.abs(a).max()
+    model.fit(a)
+    mima = 1.5 * bandwidth * np.abs(a).max()
     x = np.linspace(-mima, mima, 200).reshape(-1, 1)
     log_density = model.score_samples(x)
+
     return np.squeeze(x), np.exp(log_density)
 
 
@@ -403,18 +423,19 @@ def get_kde(a: ArrayLike, method: str='scott') -> tuple[np.ndarray, np.ndarray]:
 
     Args:
         a (array): The data.
-        method (str): The rule of thumb for bandwidth estimation.
-            Default 'scott'.
+        method (str): The rule of thumb for bandwidth estimation. Must be one
+            of 'silverman', 'scott', or 'cv'. Default 'scott'.
 
     Returns:
         tuple: (x, kde).
 
     Examples:
-        >>> data = [-3, 1, -2, -2, -2, -2, 1, 2, 2, 1, 1, 2, 0, 0, 2, 2, 3, 3]
+        >>> rng = np.random.default_rng(42)
+        >>> data = rng.normal(size=100)
         >>> x, kde = get_kde(data)
-        >>> x[0]
-        -4.5
-        >>> abs(kde[0] - 0.0015627693633590066) < 1e-09
+        >>> x[0] + 1.354649738246933 < 1e-9
+        True
+        >>> kde[0] - 0.162332012191087 < 1e-9
         True
         >>> len(kde)
         200
@@ -462,20 +483,47 @@ def kde_peaks(a: ArrayLike, method: str='scott', threshold: float=0.1) -> tuple[
 
     Args:
         a (array): The data.
-        method (str): The rule of thumb for bandwidth estimation.
-            Default 'scott'.
+        method (str): The rule of thumb for bandwidth estimation. Must be one
+            of 'silverman', 'scott', or 'cv'. Default 'scott'.
         threshold (float): The threshold for peak amplitude. Default 0.1.
 
     Returns:
         tuple: (x_peaks, y_peaks). Arrays representing the x and y values of
             the peaks.
 
     Examples:
-        >>> data = [-3, 1, -2, -2, -2, -2, 1, 2, 2, 1, 1, 2, 0, 0, 2, 2, 3, 3]
+        >>> rng = np.random.default_rng(42)
+        >>> data = np.concatenate([rng.normal(size=100)-2, rng.normal(size=100)+2])
         >>> x_peaks, y_peaks = kde_peaks(data)
         >>> x_peaks
-        array([-2.05778894,  1.74120603])
+        array([-1.67243035,  1.88998226])
         >>> y_peaks
-        array([0.15929031, 0.24708215])
+        array([0.22014721, 0.19729456])
     """
     return find_large_peaks(*get_kde(a, method), threshold=threshold)
+
+
+def is_multimodal(a: ArrayLike, method: str='scott', threshold: float=0.1) -> bool:
+    """
+    Test if the data is multimodal.
+
+    Args:
+        a (array): The data.
+        method (str): The rule of thumb for bandwidth estimation. Must be one
+            of 'silverman', 'scott', or 'cv'. Default 'scott'.
+        threshold (float): The threshold for peak amplitude. Default 0.1.
+
+    Returns:
+        bool: True if the data is multimodal.
+
+    Examples:
+        >>> rng = np.random.default_rng(42)
+        >>> data = rng.normal(size=100)
+        >>> is_multimodal(data)
+        False
+        >>> data = np.concatenate([rng.normal(size=100)-2, rng.normal(size=100)+2])
+        >>> is_multimodal(data)
+        True
+    """
+    x, y = kde_peaks(a, method=method, threshold=threshold)
+    return len(x) > 1
diff --git a/src/redflag/imbalance.py b/src/redflag/imbalance.py
@@ -7,7 +7,7 @@
 Pattern Recognition Letters 98 (2017)
 https://doi.org/10.1016/j.patrec.2017.08.002
 
-Author: Matt Hall, scienxlab.com
+Author: Matt Hall, scienxlab.org
 Licence: Apache 2.0
 
 Copyright 2022 Redflag contributors