Estimated run time of LOF? #11

song-william · 2016-07-27T19:37:55Z

The performance evaluations of LOF in the paper shows that it takes about 2000 seconds for the algorithm to detect outliers in a 10-dimensional dataset with 200000 data points. However, when I run this implementation of LOF it takes about 2400 seconds to run a 7-dimensional dataset with just 1000 points. Why the huge discrepancy in performance?

I'm just getting started with LOF and just want to experiment with it a little. I'm sorry if I'm just misinterpreting the paper.

damjankuznar · 2016-07-28T07:47:35Z

This implementation of LOF is very straightforward and without any performance considerations, which is the reason you are seeing the difference - the paper mentions the use of materialization database (pre-computed neighbors for each instance in the data set) and index for knn queries which is not implemented in pylof. Pylof is also implemented in Python (vs Java in the paper) and deliberately does not use any third party library (e.g. numpy) to improve performance.
The reason for this is that I wanted a valid implementation of LOF first and then deal with performance issues if they would arise. However, I never had performance issues (small data sets) and so I never pursued this.
Nonetheless, I would be motivated in improving this if there is any interest from the community. I would also welcome any contributions.

damjankuznar · 2016-08-02T07:44:27Z

I made an improved implementation of LOF using Numpy which works much faster which is currently in branch numpy (see a66218f). This branch also has an updated README.md with added section on performance.

@Mistasong39: Could you please test the new implementation on your data set? If everything is OK, then I will move this branch to be the new master, since the current pure Python implementation is not practical due to performance reasons.

song-william · 2016-08-04T22:05:19Z

Thanks for the conversion to numpy @damjankuznar . It works a lot faster and now I can just input a numpy array directly instead of having to convert it to a list of tuples. My 7-dimensional dataset with 1000 points now runs in just 18 seconds. It's ready for master.

However, the actual dataset I'm trying to build up to is a dimensional dataset with 300,000 points. This would still take about 3 days to run. Is there a way to import sklearn's nearest neighbor implementations of KDtree or BallTree into this implementation of LOF? A discussion of these algorithms can be found here. These indexing structures should also help improve performance.

Thanks for the rapid response. Let me know what you think.

arpit1997 · 2016-10-29T18:06:46Z

Hey the outliers implementation is awesome but i ant to suggest some things about the code improvement .

The source code should be put in a folder named pylof so that is stand aside other mundane things.
May be you can try packaging it by including a setup.py file. so that it could be more easy to reuse.
for tests you can make a new directory also.
If the simulation is slow then may be you can try using CUDA framework (computation using GPU instead of CPU). There is python wrapper available named PyCUDA.
😸 😸

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimated run time of LOF? #11

Estimated run time of LOF? #11

song-william commented Jul 27, 2016

damjankuznar commented Jul 28, 2016

damjankuznar commented Aug 2, 2016

song-william commented Aug 4, 2016 •

edited

Loading

arpit1997 commented Oct 29, 2016 •

edited

Loading

Estimated run time of LOF? #11

Estimated run time of LOF? #11

Comments

song-william commented Jul 27, 2016

damjankuznar commented Jul 28, 2016

damjankuznar commented Aug 2, 2016

song-william commented Aug 4, 2016 • edited Loading

arpit1997 commented Oct 29, 2016 • edited Loading

song-william commented Aug 4, 2016 •

edited

Loading

arpit1997 commented Oct 29, 2016 •

edited

Loading