-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance considerations #57
Comments
@NinjaTuna , I just released v0.8, where some of slowness of k-prototypes has been alleviated (#45). What version are you basing your benchmark on? |
I am running 0.8, installed using pip in Python 3. To be complete: we are talking about a dataset of 100k records, with 14 scalar features and 6 categorical ones. I am happy to further provide whatever information is helpful. |
The k-prototypes implementation in this package is heavily based on the (pretty optimized) k-modes implementation. My focus was initially more focused on readability than optimization, so I'm sure there is room for improvement on the k-means side of k-prototypes. The scikit-learn version of k-means is likely highly optimized. For example, I see that the squared norms of data points are pre-computed. I have no intentions currently to implement further speed optimizations for k-prototypes, but PRs are of course welcome. Theoretically, I see no reason why k-prototypes would be much slower than k-means. There might be some minor overhead because you have to do both k-modes and k-means updates, but of course you only do the former on the categorical and the latter on the numerical features so the computational complexity wouldn't change. |
Good Morning, thank you for your work. Thank you |
This is not a parallel implementation, and doing so would require lots of work, unfortunately. With 2M rows, you'll be in a good position to profile the code, finding bottlenecks and report back here. :) Finally, just an idea, but perhaps you can convert your problem to an all-numeric one, so you can use scikit-learn's k-means. |
I'm working on an implementation using Cython, similar to how sklearn achieves its performance. Currently we're at about 100x speedup for K-prototypes, which makes it more usable for bigger datasets, but not yet quite as fast sklearn's K-means. Parallel execution is also a possibility, but my priority right now. @nicodv, if you're interested, we can discuss incorporating my changes into this repository. I have to note, though, that moving stuff to Cython has not been quite friendly to your original codebase. K-modes is not implemented as of yet. |
@rphes , that is awesome! Would love to see what you did. I understand it might require a painful rewrite of the original code base, but let's have a look and we'll see. I've once reached out to the scikit-learn guys. Because the API of k-prototypes class is different than other cluster algorithms (the |
Great! That's good to hear. You can check out my fork. It is still very much a work in progress. The repo you link seems like a good starting point, but I think we can modify the API or convince the people at sklearn so that it might be included after all. The need for a categorical parameter seems a technicality which can be solved or worked around. |
Hi, sry for the delay to answer. |
Hey @rphes that speedup certainly sounds promising, may I ask for some doc about how exactly to use your fork to test it out (how to install it as a global package and/or import it after a cython build)? |
Hi @KristofMorva, nice to see you're interested. First up, a big disclaimer, as I'm pretty sure I broke at least KModes and the original API of this package for my own intentions. The fork is not production-ready at-all, but last time I checked (which is a while ago), it worked (meaning K-Prototypes worked).
mkdir kmodes_test && cd kmodes_test
python3 -m venv env
source env/bin/activate
git clone [email protected]:rphes/kmodes && cd kmodes
pip install -e . The You can now import the package using import cluster
# or perhaps better:
from cluster import KPrototypes (I think I renamed it) Check it out and let me know if it worked! |
@rphes thank you very much! I think it would be worth to integrate the contents of the two repositories, so the efforts can be more focused. |
Also, I have stumbled upon this new paper, which might help in performance (I have only read the abstract): http://www.mdpi.com/2073-8994/9/4/58/pdf |
@nicodv Is there any parameter we can somehow fasten the K-Prototype Clustering. Like I am trying get clusters from 3L data points. And it is taking a long time to find. Please guide. Thanks |
I am genuinely curious why the k-prototypes implementation takes 2 hours to complete for a certain dataset, whereas clustering using sklearn's k-means only takes about a minute. I've scrolled through Huang's paper, which states that k-prototypes is an extension of k-means to function properly in the presence of categorical data.
Is sklearn's k-means so well-optimized, is the computational complexity of Huang's algorithm higher than he lets on, or is there something else at play? I'd love a discussion on this.
The text was updated successfully, but these errors were encountered: