Adding Local Outlier Factor (LOF) algorithms #706

alexbeattie42 · 2021-09-14T08:40:29Z

alexbeattie42
Sep 14, 2021

Overview

I am working on my Master's Thesis related to outlier detection in streaming data. The only anomaly detection method I've found so far in river is half space trees. Per the documentation “Half space trees work well when anomalies are spread out but do not work well if anomalies are packed together in windows.”

Problem

I have been reading on various Local Outlier Factor (LOF) implementations for streaming data. 1 and 2 seem fairly promising. Adding an LOF algorithm to improve their anomaly detection is on the river roadmap already so it seems like a desired feature.

Proposal

Use a few standard streaming benchmark datasets as well as my group's private data to perform some tests.

Methods

For the first set of tests I plan to include existing streaming methods (ex. river’s half space trees outlier detection method, pysad, etc.) Then I plan to compare those results to results obtained by running the entire dataset through a non-streaming approach (ex. pyOD methods , statistical methods, etc) to see if the current streaming approaches provides comparable results (my hypothesis is that they will not for certain cases, leading to the next part).

Addition to River

Implement one of the proposed LOF variants referenced in the "problem" section in river and compare it to the current half space tree method or other methods from pySAD.

Question

Before starting this work, I wanted to make sure that adding the LOF streaming method is in line with the goals of the library and would be something useful. Any ideas or suggestions on good benchmark datasets (I'm currently compiling a list of commonly used ones in the literature) or methodology improvements are also welcome.

MaxHalford · 2021-09-16T17:13:48Z

MaxHalford
Sep 16, 2021
Maintainer

Hello, thank you for your motivation

Before starting this work, I wanted to make sure that adding the LOF streaming method is in line with the goals of the library and would be something useful.

I see no problem. In fact, online anomaly detection is in demand, so a contribution would be very welcome.

The first step is too see to what extend LOF can be performed online. We don't want to have semi-batch approaches that aren't truly online. Ideally, the complexity of learn_one should be O(1).

Any ideas or suggestions on good benchmark datasets (I'm currently compiling a list of commonly used ones in the literature) or methodology improvements are also welcome.

I'm sure you'll find the "good" ones by yourself. There aren't that many. If you do find one that we don't have, feel free to add it to the datasets module.

Meta: you can make use of Markdown syntax to insert links. You don't have to copy/paste in plain text.

12 replies

jinxmirror13 Oct 1, 2021

No problems publishing it elsewhere so more users can access it! I ran into some issues trying to publish it with them, but reaching out to see what can be done. Thank you for the recommendation @alexbeattie42 and @MaxHalford !

jinxmirror13 Oct 13, 2021

So I didn't manage to get UCI to work, but discovered an internal service at Imperial which did. You should be able to get the current version of the full dataset by running this:

wget https://data.hpc.imperial.ac.uk/resolve/\?doi\=9422\&file\=4\&access\= -O full_BETH_dataset.zip

MaxHalford Oct 13, 2021
Maintainer

That's great, thanks a lot of. I'll add the dataset very soon. For my interest, is that your university of affiliation, or can anyone theoretically store datasets there?

jinxmirror13 Oct 13, 2021

You have to be affiliated, but maybe other Universities are starting to offer similar services?

MaxHalford Oct 13, 2021
Maintainer

Yes, you're right. It's just something I haven't looked into yet. Thanks again! I'll ping everyone in this discussion once I've added the dataset.

Edit: the dataset is quite complex, and there are different ways to process it. I'll have to take some more time later to add it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Local Outlier Factor (LOF) algorithms #706

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Adding Local Outlier Factor (LOF) algorithms #706

alexbeattie42 Sep 14, 2021

Overview

Problem

Proposal

Methods

Addition to River

Question

Replies: 1 comment · 12 replies

MaxHalford Sep 16, 2021 Maintainer

jinxmirror13 Oct 1, 2021

jinxmirror13 Oct 13, 2021

MaxHalford Oct 13, 2021 Maintainer

jinxmirror13 Oct 13, 2021

MaxHalford Oct 13, 2021 Maintainer

alexbeattie42
Sep 14, 2021

Replies: 1 comment 12 replies

MaxHalford
Sep 16, 2021
Maintainer

MaxHalford Oct 13, 2021
Maintainer

MaxHalford Oct 13, 2021
Maintainer