Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of LODA (Lightweight On-line Detection of Anomalies) #1342

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hoanganhngo610
Copy link
Contributor

LODA (Lightweight On-line Detection of Anomalies) is a popular static/incremental anomaly detection algorithm. LODA is implemented in various frameworks for anomaly detection, including PyOD and PySAD.

Unlike various other algorithms, this implementation of LODA relies primarily on numpy, since the function np.histogram plays an important part in the learning phase and calculation of anomaly score. This implementation is also an adaptation of previous implementations of PyOD and PySAD to River.

@hoanganhngo610
Copy link
Contributor Author

hoanganhngo610 commented Oct 20, 2023

@MaxHalford Would you mind having a look at this PR, and also on why River build is failing for 3.11? Thank you so much in advance!

@MaxHalford
Copy link
Member

Hey @hoanganhngo610!

Good job on taking the initiative to implement this.

The main feedback I have is that we should avoid using numpy for new implementations. As you know, we prefer working exclusively with dicts. I know you want to use numpy's histogram: we have sketch.Histogram which may do the trick. Could you take a look at using that instead?

Regarding CI, I'm aware we have an issue. It's on my todo list :)

@hoanganhngo610
Copy link
Contributor Author

Hey @MaxHalford ,

Thank you so much for your response, and also thank you so much for having a look at my PR so promptly.

Regarding the implementation using dictionaries and particularly sketch.Histogram, actually that was the first idea coming to my mind. However, upon comparing the results returned by np.Histogram versus, sketch.Histogram, unfortunately, the results are significantly different.

For example, we can take a look at the example provided in River with sketch.Histogram itself:

from river import sketch
import numpy as np
np.random.seed(42)
values = np.hstack((
    np.random.normal(-3, 1, 1000),
    np.random.normal(3, 1, 1000),
))
hist = sketch.Histogram(max_bins=10)
for x in values:
    hist = hist.update(x)
for bin in hist:
    print(bin)

which will return

[-6.19837, -5.84386]: 4
[-5.52527, -4.29012]: 79
[-4.28605, -2.54958]: 583
[-2.54880, -1.36430]: 281
[-1.35756, 0.37807]: 57
[0.46562, 1.77854]: 110
[1.78993, 2.70345]: 253
[2.70739, 4.37628]: 544
[4.37893, 5.44241]: 85
[5.54414, 6.09777]: 4

However, with numpy, under

np.histogram(values, bins=10)

the result would be

(array([ 12, 205, 464, 277,  44,  35, 234, 446, 257,  26]),
 array([-6.19836796, -4.96875389, -3.73913983, -2.50952576, -1.2799117 ,
        -0.05029763,  1.17931644,  2.4089305 ,  3.63854457,  4.86815863,
         6.0977727 ]))

As such, to ensure that the results are consistent with other implementation in PyOD or PySAD, I decided to continue with the use of numpy instead of dictionary. Hope that this answer is satisfactory and can justify my implementation!

@MaxHalford
Copy link
Member

@hoanganhngo610 I believe the difference comes from the fact the histogram definitions in both cases. The histogram in numpy has equal bin widths, while the one in River is adaptive. The thing is, equal bin widths is not difficult to implement online.

I've actually started #1344 to remove dict2numpy and numpy2dict. They were not actually used anymore. I really believe we should not allow ourselves to use numpy anymore. I get it, it's practical. But as you know there's a price to pay to initialize numpy arrays online. They're made to do computation on batch data, not online data.

@hoanganhngo610
Copy link
Contributor Author

@MaxHalford Thank you for clarifying that! In that case, I would proceed to modify this PR to use dictionaries andsketch.Histogram as discussed.

@MaxHalford
Copy link
Member

Actually what I'm saying is that River's histogram is not based on equi-width bins. If you use it, it might not yield optimal performance. I suggest doing some speed/accuracy benchmarks with River vs. numpy. If it's more accurate with numpy's equi-width bins, then let's implement them with dictionaries, because it's trivial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants