You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PyThresh is a comprehensive and scalable Python toolkit for
thresholding outlier detection likelihood scores in
univariate/multivariate data. It has been written to work in tandem with
PyOD and has similar syntax and data structures. However, it is not
limited to this single library. PyThresh is meant to threshold
likelihood scores generated by an outlier detector. It thresholds these
likelihood scores and replaces the need to set a contamination level or
have the user guess the amount of outliers that may exist in the dataset
beforehand. These non-parametric methods were written to reduce the
user's input/guess work and rather rely on statistics instead to
threshold outlier likelihood scores. For thresholding to be applied
correctly, the outlier detection likelihood scores must follow this
rule: the higher the score, the higher the probability that it is an
outlier in the dataset. All threshold functions return a binary array
where inliers and outliers are represented by a 0 and 1 respectively.
PyThresh includes more than 30 thresholding algorithms. These algorithms
range from using simple statistical analysis like the Z-score to more
complex mathematical methods that involve graph theory and topology.
Documentation & Citing
Visit PyThresh Docs for full
documentation or see below for a quickstart installation and usage
example.
Outlier Detection Thresholding with 7 Lines of Code:
# train the KNN detectorfrompyod.models.knnimportKNNfrompythresh.thresholds.filterimportFILTERclf=KNN()
clf.fit(X_train)
# get outlier scoresdecision_scores=clf.decision_scores_# raw outlier scores on the train data# get outlier labelsthres=FILTER()
labels=thres.eval(decision_scores)
or using multiple outlier detection score sets
# train multiple detectorsfrompyod.models.knnimportKNNfrompyod.models.pcaimportPCAfrompyod.models.iforestimportIForestfrompythresh.thresholds.filterimportFILTERclfs= [KNN(), IForest(), PCA()]
# get outlier scores for each detectorscores= [clf.fit(X_train).decision_scores_forclfinclfs]
scores=np.vstack(scores).T# get outlier labelsthres=FILTER()
labels=thres.eval(scores)
Installation
It is recommended to use pip or conda for installation:
pip install pythresh # normal install
pip install --upgrade pythresh # or update if needed
conda install -c conda-forge pythresh
Alternatively, you can get the version with the latest updates by
cloning the repo and run setup.py file:
git clone https://github.com/KulikDM/pythresh.git
cd pythresh
pip install .
joblib>=0.14.1 (used in the META thresholder and RANK)
pandas (used in the META thresholder)
torch (used in the VAE thresholder)
tqdm (used in the VAE thresholder)
xgboost>=2.0.0 (used in the RANK)
API Cheatsheet
eval(score): evaluate a single outlier or multiple outlier
detection likelihood score sets.
Key Attributes of threshold:
thresh_: Return the threshold value that separates inliers from
outliers. Outliers are considered all values above this threshold
value. Note the threshold value has been derived from likelihood
scores normalized between 0 and 1.
confidence_interval_: Return the lower and upper confidence
interval of the contamination level. Only applies to the COMB
thresholder
dscores_: 1D array of the TruncatedSVD decomposed decision scores
if multiple outlier detector score sets are passed
mixture_: fitted mixture model class of the selected model used
for thresholding. Only applies to MIXMOD. Attributes include:
components, weights, params. Functions include: fit, loglikelihood,
pdf, and posterior.
The comparison among implemented models and general implementation
is made available below
Additional benchmarking has been
done on all the thresholders and it was found that the MIXMOD
thresholder performed best while the CLF thresholder provided the
smallest uncertainty about its mean and is the most robust (best least
accurate prediction). However, for interpretability and general
performance the MIXMOD, FILTER, and META thresholders are good
fits.
Further utilities are available for assisting in the selection of the
most optimal outlier detection and thresholding methods ranking as well as
determining the confidence with regards to the selected thresholding
method thresholding confidence
For Jupyter Notebooks, please navigate to notebooks.
A quick look at all the thresholders performance can be found at
"/notebooks/Compare All Models.ipynb"
Contributing
Anyone is welcome to contribute to PyThresh:
Please share your ideas and ask questions by opening an issue.
To contribute, first check the Issue list for the "help wanted" tag
and comment on the one that you are interested in. The issue will
then be assigned to you.
If the bug, feature, or documentation change is novel (not in the
Issue list), you can either log a new issue or create a pull request
for the new changes.
To start, fork the main branch and add your
improvement/modification/fix.
To make sure the code has the same style and standard, please refer
to qmcd.py for example.
Create a pull request to the main branch and follow the pull
request template PR template
Please make sure that all code changes are accompanied with proper
new/updated test functions. Automatic tests will be triggered. Before
the pull request can be merged, make sure that all the tests pass.
References
Please Note not all references' exact methods have been employed in
PyThresh. Rather, the references serve to demonstrate the validity of
the threshold types available in PyThresh.