Isolation Forest Implementation

The goal of this project is to implement the original Isolation Forest algorithm by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou as a part of MSDS689 course. (A later version of this work is also available: Isolation-based Anomaly Detection.) There are two general approaches to anomaly detection:

Model what normal looks like and then look for anomalous observations
Anomalies are few in quantity and are different which can be easily separable in tree-space.

The isolation forest algorithm is original and beautiful in its simplicity; and also seems to work very well, with a few known weaknesses. The academic paper is extremely readable so you should start there.

Datasets

This implementation is tested on the following datasets:

Kaggle credit card fraud competition data set; download, unzip to get creditcard.csv
Get cancer data into cancer.csv by executing savecancer.csv that I provide.
http.zip; download, unzip to get http.csv.

Visualization of normal versus anomaly separation

Using plot_anomalies.py, we can see the results of the isolation forest trying to detect anomalies. These data sets all have known targets indicating normal versus anomaly, but this information is only used during testing and not during training.

http.csv, 200 trees, 99% desired TPR

creditcard.csv, 200 trees, 80% desired TPR	creditcard.csv, 200 trees, 90% desired TPR

cancer, 300 trees, 70% desired TPR	cancer, 300 trees, 80% desired TPR

Algorithm

The algorithm is based on the idea that the anomalies occur in isolation and it is much easier to separate anomalies using random split in the feature range compared to normal instances. To simplify, anomalies are rare and different which occurs in isolations. The algorithm randomly selects a small subset of observations. Next, we randomly select a feature column and a split value from the selected feature which lies between the min and max values. Note that, the current algorithm only works for continuous features. We keep growing the trees till the termination condition is reached. We repeat the same process and grow a forest to reduce variance. Anomaly score is calculated based on the average path length of each observation across the forest. We normalize the length with the standard height of a binary tree and expect the length of the anomaly to be smaller compared to normal instances. For convenience, here are the algorithms extracted from the Liu et al paper:

a different version of average path length is used c(), not the one in the original paper:

Then finally here's the scoring formula:

where "H(i) is the harmonic number and it can be estimated by ln(i) + 0.5772156649 (Euler’s constant)."

Note that, we also implement an improved version of this algorithm which can handle noisy labels by making the split point unbalanced rather than choosing the split value uniformly from the min and max value. The implementation can be found in iforest.py

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
images		images
IsolationForestPaper.pdf		IsolationForestPaper.pdf
README.md		README.md
iforest.py		iforest.py
plot_anomalies.py		plot_anomalies.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Isolation Forest Implementation

Datasets

Visualization of normal versus anomaly separation

Algorithm

About

Releases

Packages

Languages

gokul-krishna/Isolation-Forest

Folders and files

Latest commit

History

Repository files navigation

Isolation Forest Implementation

Datasets

Visualization of normal versus anomaly separation

Algorithm

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages