Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimated run time of LOF? #11

Open
song-william opened this issue Jul 27, 2016 · 4 comments
Open

Estimated run time of LOF? #11

song-william opened this issue Jul 27, 2016 · 4 comments

Comments

@song-william
Copy link

The performance evaluations of LOF in the paper shows that it takes about 2000 seconds for the algorithm to detect outliers in a 10-dimensional dataset with 200000 data points. However, when I run this implementation of LOF it takes about 2400 seconds to run a 7-dimensional dataset with just 1000 points. Why the huge discrepancy in performance?

I'm just getting started with LOF and just want to experiment with it a little. I'm sorry if I'm just misinterpreting the paper.

@damjankuznar
Copy link
Owner

This implementation of LOF is very straightforward and without any performance considerations, which is the reason you are seeing the difference - the paper mentions the use of materialization database (pre-computed neighbors for each instance in the data set) and index for knn queries which is not implemented in pylof. Pylof is also implemented in Python (vs Java in the paper) and deliberately does not use any third party library (e.g. numpy) to improve performance.
The reason for this is that I wanted a valid implementation of LOF first and then deal with performance issues if they would arise. However, I never had performance issues (small data sets) and so I never pursued this.
Nonetheless, I would be motivated in improving this if there is any interest from the community. I would also welcome any contributions.

@damjankuznar
Copy link
Owner

I made an improved implementation of LOF using Numpy which works much faster which is currently in branch numpy (see a66218f). This branch also has an updated README.md with added section on performance.

@Mistasong39: Could you please test the new implementation on your data set? If everything is OK, then I will move this branch to be the new master, since the current pure Python implementation is not practical due to performance reasons.

@song-william
Copy link
Author

song-william commented Aug 4, 2016

Thanks for the conversion to numpy @damjankuznar . It works a lot faster and now I can just input a numpy array directly instead of having to convert it to a list of tuples. My 7-dimensional dataset with 1000 points now runs in just 18 seconds. It's ready for master.

However, the actual dataset I'm trying to build up to is a dimensional dataset with 300,000 points. This would still take about 3 days to run. Is there a way to import sklearn's nearest neighbor implementations of KDtree or BallTree into this implementation of LOF? A discussion of these algorithms can be found here. These indexing structures should also help improve performance.

Thanks for the rapid response. Let me know what you think.

@arpit1997
Copy link

arpit1997 commented Oct 29, 2016

Hey the outliers implementation is awesome but i ant to suggest some things about the code improvement .

  • The source code should be put in a folder named pylof so that is stand aside other mundane things.
  • May be you can try packaging it by including a setup.py file. so that it could be more easy to reuse.
  • for tests you can make a new directory also.
  • If the simulation is slow then may be you can try using CUDA framework (computation using GPU instead of CPU). There is python wrapper available named PyCUDA.
    😸 😸

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants