The goal of this project is to implement the original Isolation Forest algorithm by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou as a part of MSDS689 course. (A later version of this work is also available: Isolation-based Anomaly Detection.) There are two general approaches to anomaly detection:
- Model what normal looks like and then look for anomalous observations
- Anomalies are few in quantity and are different which can be easily separable in tree-space.
The isolation forest algorithm is original and beautiful in its simplicity; and also seems to work very well, with a few known weaknesses. The academic paper is extremely readable so you should start there.
This implementation is tested on the following datasets:
-
Kaggle credit card fraud competition data set; download, unzip to get
creditcard.csv
-
Get cancer data into
cancer.csv
by executing savecancer.csv that I provide. -
http.zip; download, unzip to get
http.csv
.
Using plot_anomalies.py, we can see the results of the isolation forest trying to detect anomalies. These data sets all have known targets indicating normal versus anomaly, but this information is only used during testing and not during training.
http.csv, 200 trees, 99% desired TPR |
|
creditcard.csv, 200 trees, 80% desired TPR | creditcard.csv, 200 trees, 90% desired TPR |
|
|
cancer, 300 trees, 70% desired TPR | cancer, 300 trees, 80% desired TPR |
|
|
The algorithm is based on the idea that the anomalies occur in isolation and it is much easier to separate anomalies using random split in the feature range compared to normal instances. To simplify, anomalies are rare and different which occurs in isolations. The algorithm randomly selects a small subset of observations. Next, we randomly select a feature column and a split value from the selected feature which lies between the min and max values. Note that, the current algorithm only works for continuous features. We keep growing the trees till the termination condition is reached. We repeat the same process and grow a forest to reduce variance. Anomaly score is calculated based on the average path length of each observation across the forest. We normalize the length with the standard height of a binary tree and expect the length of the anomaly to be smaller compared to normal instances. For convenience, here are the algorithms extracted from the Liu et al paper:
Note that, we also implement an improved version of this algorithm which can handle noisy labels by making the split point unbalanced rather than choosing the split value uniformly from the min and max value. The implementation can be found in iforest.py