Adding Local Outlier Factor (LOF) algorithms #706
Replies: 1 comment 12 replies
-
Hello, thank you for your motivation
I see no problem. In fact, online anomaly detection is in demand, so a contribution would be very welcome. The first step is too see to what extend LOF can be performed online. We don't want to have semi-batch approaches that aren't truly online. Ideally, the complexity of
I'm sure you'll find the "good" ones by yourself. There aren't that many. If you do find one that we don't have, feel free to add it to the Meta: you can make use of Markdown syntax to insert links. You don't have to copy/paste in plain text. |
Beta Was this translation helpful? Give feedback.
-
Overview
I am working on my Master's Thesis related to outlier detection in streaming data. The only anomaly detection method I've found so far in river is half space trees. Per the documentation “Half space trees work well when anomalies are spread out but do not work well if anomalies are packed together in windows.”
Problem
I have been reading on various Local Outlier Factor (LOF) implementations for streaming data. 1 and 2 seem fairly promising. Adding an LOF algorithm to improve their anomaly detection is on the river roadmap already so it seems like a desired feature.
Proposal
Use a few standard streaming benchmark datasets as well as my group's private data to perform some tests.
Methods
For the first set of tests I plan to include existing streaming methods (ex. river’s half space trees outlier detection method, pysad, etc.) Then I plan to compare those results to results obtained by running the entire dataset through a non-streaming approach (ex. pyOD methods , statistical methods, etc) to see if the current streaming approaches provides comparable results (my hypothesis is that they will not for certain cases, leading to the next part).
Addition to River
Implement one of the proposed LOF variants referenced in the "problem" section in river and compare it to the current half space tree method or other methods from pySAD.
Question
Before starting this work, I wanted to make sure that adding the LOF streaming method is in line with the goals of the library and would be something useful. Any ideas or suggestions on good benchmark datasets (I'm currently compiling a list of commonly used ones in the literature) or methodology improvements are also welcome.
Beta Was this translation helpful? Give feedback.
All reactions