Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare for release #50

Merged
merged 16 commits into from
Sep 3, 2023
Merged
7 changes: 4 additions & 3 deletions .github/workflows/build-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,17 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11"]
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]

steps:

- uses: actions/checkout@v2
- uses: actions/checkout@v3

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
allow-prereleases: true

- name: Install dependencies
run: |
Expand Down
17 changes: 14 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,19 @@
# Changelog

## 0.1.11, in development

- Coming soon...
## 0.2.0, 3 September 2023

- Moved to something more closely resembling semantic versioning, which is the main reason this is version 0.2.0.
- Builds and tests on Python 3.11 have been successful, so now supporting this version. Started testing on Python 3.12, which is not supported for the time being.
- Added custom 'alarm' `Detector`, which can be instantiated with a function and a warning to emit when the function returns True for a 1D array. You can easily write your own detectors with this class.
- Added `make_detector_pipeline()` which can take sequences of functions and warnings (or a mapping of functions to warnings) and returns a `scikit-learn.pipeline.Pipeline` containing a `Detector` for each function.
- Added `RegressionMultimodalDetector` to allow detection of non-unimodal distributions in features, when considered across the entire dataset. (Coming soon, a similar detector for classification tasks that will partition the data by class.)
- Redefined `is_standardized` (deprecated) as `is_standard_normal`, which implements the Kolmogorov–Smirnov test. It seems more reliable than assuming the data will have a mean of almost exactly 0 and standard deviation of exactly 1, when all we really care about is that the feature is roughly normal.
- Changed the wording slightly in the existing detector warning messages.
- No longer warning if `y` is `None` in, eg, `ImportanceDetector`, since you most likely know this.
- Some changes to `ImportanceDetector`. It now uses KNN estimators instead of SVMs as the third measure of importance; the SVMs were too unstable, causing numerical issues. It also now requires that the number of important features is less than the total number of features to be triggered. So if you have 2 features and both are important, it does not trigger.
- Improved `is_continuous()` which was erroneously classifying integer arrays with many consecutive values as non-continuous.
- Added a `Tutorial.ipynb` notebook to the docs.
- Added a **Copy** button to code blocks in the docs.


## 0.1.10, 21 November 2022
Expand Down
2 changes: 0 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@

🚩 `redflag` aims to be an automatic safety net for machine learning datasets. The vision is to accept input of a Pandas `DataFrame` or NumPy `ndarray` (one for each of the input `X` and target `y` in a machine learning task). `redflag` will provide an analysis of each feature, and of the target, including aspects such as class imbalance, leakage, outliers, anomalous data patterns, threats to the IID assumption, and so on. The goal is to complement other projects like `pandas-profiling` and `greatexpectations`.

⚠️ **This project is very rough and does not do much yet. The API will very likely change without warning. Please consider contributing!**


## Installation

Expand Down
5 changes: 3 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,12 @@ def setup(app):
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.githubpages',
'sphinxcontrib.apidoc',
'sphinx.ext.githubpages',
'sphinx.ext.napoleon',
'myst_nb',
'sphinx.ext.coverage',
'sphinx_copybutton',
'myst_nb',
]

myst_enable_extensions = ["dollarmath", "amsmath"]
Expand Down
5 changes: 3 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ User guide
installation
_notebooks/Basic_usage.ipynb
_notebooks/Using_redflag_with_sklearn.ipynb
_notebooks/Tutorial.ipynb


API reference
Expand Down Expand Up @@ -82,5 +83,5 @@ Indices and tables
PyPI releases <https://pypi.org/project/redflag/>
Code in GitHub <https://github.com/scienxlab/redflag>
Issue tracker <https://github.com/scienxlab/redflag/issues>
Community guidelines <https://scienxlab.com/community>
Scienxlab <https://scienxlab.com>
Community guidelines <https://scienxlab.org/community>
Scienxlab <https://scienxlab.org>
35 changes: 0 additions & 35 deletions docs/make.bat

This file was deleted.

214 changes: 131 additions & 83 deletions docs/notebooks/Tutorial.ipynb

Large diffs are not rendered by default.

156 changes: 146 additions & 10 deletions docs/notebooks/Using_redflag_with_sklearn.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/post_process_html.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ def add_analytics(html):
"""
s = r'</head>'
pattern = re.compile(s)
new_s = '<script defer data-domain="scienxlab.com" src="https://plausible.io/js/plausible.js"></script></head>'
new_s = '<script defer data-domain="scienxlab.org" src="https://plausible.io/js/plausible.js"></script></head>'
html = pattern.sub(new_s, html)

return html
Expand Down
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ classifiers = [
]

dependencies = [
"numpy<2.0", # NumPy 2 will likely break some things.
"scipy!=1.10.0", # Bug in stats.powerlaw.
"scikit-learn",
]
Expand All @@ -46,7 +47,7 @@ dev = [
]

[project.urls]
"documentation" = "https://scienxlab.github.io/redflag"
"documentation" = "https://scienxlab.org/redflag"
"repository" = "https://github.com/scienxlab/redflag"

[tool.setuptools_scm]
Expand Down
18 changes: 6 additions & 12 deletions src/redflag/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,11 @@
from .importance import *
from .outliers import *

# From https://github.com/pypa/setuptools_scm
from importlib.metadata import version, PackageNotFoundError

from pkg_resources import get_distribution, DistributionNotFound
try:
VERSION = get_distribution(__name__).version
except DistributionNotFound:
try:
from ._version import version as VERSION
except ImportError:
raise ImportError(
"Failed to find (autogenerated) _version.py. "
"This might be because you are installing from GitHub's tarballs, "
"use the PyPI ones."
)
__version__ = VERSION
__version__ = version("package-name")
except PackageNotFoundError:
# package is not installed
pass
108 changes: 78 additions & 30 deletions src/redflag/distributions.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""
Functions related to understanding distributions.

Author: Matt Hall, scienxlab.com
Author: Matt Hall, scienxlab.org
Licence: Apache 2.0

Copyright 2022 Redflag contributors
Expand Down Expand Up @@ -34,7 +34,7 @@
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV

from .utils import is_standardized
from .utils import is_standard_normal
from .utils import iter_groups


Expand Down Expand Up @@ -256,9 +256,9 @@ def wasserstein(X: ArrayLike,
except AttributeError:
# It's probably a 1D array or list.
pass

if stacked:
if not is_standardized(first):
if not is_standard_normal(first.flat):
warnings.warn('First group does not appear to be standardized.', stacklevel=2)
groups = np.hstack([len(dataset)*[i] for i, dataset in enumerate(X)])
X = np.vstack(X)
Expand All @@ -267,7 +267,7 @@ def wasserstein(X: ArrayLike,
X = np.asarray(X)
if X.ndim != 2:
raise ValueError("X must be a 2D array-like.")

if groups is None:
raise ValueError("Must provide a 1D array of group labels if X is a 2D array.")
n_groups = np.unique(groups).size
Expand Down Expand Up @@ -303,9 +303,13 @@ def bw_silverman(a: ArrayLike) -> float:
"""
Calculate the Silverman bandwidth.

Silverman, BW (1981), "Using kernel density estimates to investigate
multimodality", Journal of the Royal Statistical Society. Series B Vol. 43,
No. 1 (1981), pp. 97-99.

Args:
a (array): The data.

Returns:
float: The Silverman bandwidth.

Expand All @@ -321,7 +325,7 @@ def bw_silverman(a: ArrayLike) -> float:
def bw_scott(a: ArrayLike) -> float:
"""
Calculate the Scott bandwidth.

Args:
a (array): The data.

Expand Down Expand Up @@ -350,12 +354,20 @@ def cv_kde(a: ArrayLike, n_bandwidths: int=20, cv: int=10) -> float:
Returns:
float. The optimal bandwidth.

Examples:
>>> data = [1, 1, 1, 2, 2, 1, 1, 2, 2, 3, 2, 2, 2, 3, 3]
>>> abs(cv_kde(data, n_bandwidths=3, cv=3) - 0.290905379576344) < 1e-9
True
Example:
>>> rng = np.random.default_rng(42)
>>> data = rng.normal(size=100)
>>> cv_kde(data, n_bandwidths=3, cv=3)
0.5212113989811242
"""
a = np.asarray(a).reshape(-1, 1)
a = np.asarray(a)
if not is_standard_normal(a):
warnings.warn('Data does not appear to be standardized, the KDE may be a poor fit.', stacklevel=2)
if a.ndim == 1:
a = a.reshape(-1, 1)
elif a.ndim >= 2:
raise ValueError("Data must be 1D.")

silverman = bw_silverman(a)
scott = bw_scott(a)
start = min(silverman, scott)/2
Expand All @@ -378,22 +390,30 @@ def fit_kde(a: ArrayLike, bandwidth: float=1.0, kernel: str='gaussian') -> tuple
Returns:
tuple: (x, kde).

Examples:
>>> data = [-3, 1, -2, -2, -2, -2, 1, 2, 2, 1, 1, 2, 0, 0, 2, 2, 3, 3]
Example:
>>> rng = np.random.default_rng(42)
>>> data = rng.normal(size=100)
>>> x, kde = fit_kde(data)
>>> x[0]
-4.5
>>> abs(kde[0] - 0.011092399847113) < 1e-9
>>> x[0] + 3.2124714013056916 < 1e-9
True
>>> kde[0] - 0.014367259502733645 < 1e-9
True
>>> len(kde)
200
"""
a = np.asarray(a)
if not is_standard_normal(a):
warnings.warn('Data does not appear to be standardized, the KDE may be a poor fit.', stacklevel=2)
if a.ndim == 1:
a = a.reshape(-1, 1)
elif a.ndim >= 2:
raise ValueError("Data must be 1D.")
model = KernelDensity(kernel=kernel, bandwidth=bandwidth)
model.fit(a.reshape(-1, 1))
mima = 1.5 * np.abs(a).max()
model.fit(a)
mima = 1.5 * bandwidth * np.abs(a).max()
x = np.linspace(-mima, mima, 200).reshape(-1, 1)
log_density = model.score_samples(x)

return np.squeeze(x), np.exp(log_density)


Expand All @@ -403,18 +423,19 @@ def get_kde(a: ArrayLike, method: str='scott') -> tuple[np.ndarray, np.ndarray]:

Args:
a (array): The data.
method (str): The rule of thumb for bandwidth estimation.
Default 'scott'.
method (str): The rule of thumb for bandwidth estimation. Must be one
of 'silverman', 'scott', or 'cv'. Default 'scott'.

Returns:
tuple: (x, kde).

Examples:
>>> data = [-3, 1, -2, -2, -2, -2, 1, 2, 2, 1, 1, 2, 0, 0, 2, 2, 3, 3]
>>> rng = np.random.default_rng(42)
>>> data = rng.normal(size=100)
>>> x, kde = get_kde(data)
>>> x[0]
-4.5
>>> abs(kde[0] - 0.0015627693633590066) < 1e-09
>>> x[0] + 1.354649738246933 < 1e-9
True
>>> kde[0] - 0.162332012191087 < 1e-9
True
>>> len(kde)
200
Expand Down Expand Up @@ -462,20 +483,47 @@ def kde_peaks(a: ArrayLike, method: str='scott', threshold: float=0.1) -> tuple[

Args:
a (array): The data.
method (str): The rule of thumb for bandwidth estimation.
Default 'scott'.
method (str): The rule of thumb for bandwidth estimation. Must be one
of 'silverman', 'scott', or 'cv'. Default 'scott'.
threshold (float): The threshold for peak amplitude. Default 0.1.

Returns:
tuple: (x_peaks, y_peaks). Arrays representing the x and y values of
the peaks.

Examples:
>>> data = [-3, 1, -2, -2, -2, -2, 1, 2, 2, 1, 1, 2, 0, 0, 2, 2, 3, 3]
>>> rng = np.random.default_rng(42)
>>> data = np.concatenate([rng.normal(size=100)-2, rng.normal(size=100)+2])
>>> x_peaks, y_peaks = kde_peaks(data)
>>> x_peaks
array([-2.05778894, 1.74120603])
array([-1.67243035, 1.88998226])
>>> y_peaks
array([0.15929031, 0.24708215])
array([0.22014721, 0.19729456])
"""
return find_large_peaks(*get_kde(a, method), threshold=threshold)


def is_multimodal(a: ArrayLike, method: str='scott', threshold: float=0.1) -> bool:
"""
Test if the data is multimodal.

Args:
a (array): The data.
method (str): The rule of thumb for bandwidth estimation. Must be one
of 'silverman', 'scott', or 'cv'. Default 'scott'.
threshold (float): The threshold for peak amplitude. Default 0.1.

Returns:
bool: True if the data is multimodal.

Examples:
>>> rng = np.random.default_rng(42)
>>> data = rng.normal(size=100)
>>> is_multimodal(data)
False
>>> data = np.concatenate([rng.normal(size=100)-2, rng.normal(size=100)+2])
>>> is_multimodal(data)
True
"""
x, y = kde_peaks(a, method=method, threshold=threshold)
return len(x) > 1
2 changes: 1 addition & 1 deletion src/redflag/imbalance.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
Pattern Recognition Letters 98 (2017)
https://doi.org/10.1016/j.patrec.2017.08.002

Author: Matt Hall, scienxlab.com
Author: Matt Hall, scienxlab.org
Licence: Apache 2.0

Copyright 2022 Redflag contributors
Expand Down
Loading
Loading