-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimated run time of LOF? #11
Comments
This implementation of LOF is very straightforward and without any performance considerations, which is the reason you are seeing the difference - the paper mentions the use of materialization database (pre-computed neighbors for each instance in the data set) and index for knn queries which is not implemented in pylof. Pylof is also implemented in Python (vs Java in the paper) and deliberately does not use any third party library (e.g. numpy) to improve performance. |
I made an improved implementation of LOF using Numpy which works much faster which is currently in branch @Mistasong39: Could you please test the new implementation on your data set? If everything is OK, then I will move this branch to be the new master, since the current pure Python implementation is not practical due to performance reasons. |
Thanks for the conversion to numpy @damjankuznar . It works a lot faster and now I can just input a numpy array directly instead of having to convert it to a list of tuples. My 7-dimensional dataset with 1000 points now runs in just 18 seconds. It's ready for master. However, the actual dataset I'm trying to build up to is a dimensional dataset with 300,000 points. This would still take about 3 days to run. Is there a way to import sklearn's nearest neighbor implementations of KDtree or BallTree into this implementation of LOF? A discussion of these algorithms can be found here. These indexing structures should also help improve performance. Thanks for the rapid response. Let me know what you think. |
Hey the outliers implementation is awesome but i ant to suggest some things about the code improvement .
|
The performance evaluations of LOF in the paper shows that it takes about 2000 seconds for the algorithm to detect outliers in a 10-dimensional dataset with 200000 data points. However, when I run this implementation of LOF it takes about 2400 seconds to run a 7-dimensional dataset with just 1000 points. Why the huge discrepancy in performance?
I'm just getting started with LOF and just want to experiment with it a little. I'm sorry if I'm just misinterpreting the paper.
The text was updated successfully, but these errors were encountered: