-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel implementation #76
Comments
Great, thanks for your contribution, @rphes ! I'll try to find some time to go over the code in detail, but it looks good at first glance. FYI, beyond using your own data set, you could play around with the benchmark script in the A PR with parallel implementations for both algorithms is very welcome. I suggest the default is to set |
@rphes , I've merged the PR. As part of this ticket, could you please could update the examples and readme to showcase this new feature? |
Great! Will do |
Hee, het valt me nu pas op dat je aan mijn oude universiteit studeert! 😄 |
Haha, 'vo! Een maandje nog en dan ben ik klaar! Ik had ook wel zo'n vermoeden door je naam en de techjob in de states. |
This was merged. Thanks for the neat contribution, @rphes ! |
Hello again,
The K-Prototypes algorithm continues to intrigue me, so I played around with it some more. I did the Cython implementation before, as discussed in #57, but the changes required for that were so far-reaching I can imagine merging the back into this repository is challenging.
In the meantime, I implemented both K-Modes and K-Prototypes in a parallel fashion, basically copying the strategy applied in scikit-learn, namely doing multiple runs with different initializations in parallel using joblib.
Minimal code changes are required to accomplish this, so it might make a better initial candidate to improve performance. Most changes are related to the way the algorithms deal with random variables, as this becomes a little more difficult to keep reproducible when adding threading into the mix.
I did some testing on a dataset that I unfortunately cannot share, the results of which are in this gist:
https://gist.github.com/rphes/48569eb0c929d33deef18c9de0d96aa8
Interesting observations are that K-Prototypes benefits from multithreading whereas K-Modes really does not. I think this might be due to the very low complexity of the K-Modes algorithm, making it memory-bound. Threading therefore only introduces additional overhead due to the forking process and resource contention between threads.
K-Prototypes performs some actual computations, so it is less affected by this, but still we see performance increase level off at 4 cores, presumable for the same reason.
Now, if you'd like to incorporate this stuff, I'm more than happy to submit a PR, with a parallel version of K-Prototypes, or both algorithms with added tests and documentation. The only caveat is that users need to consider whether their use-case is actually going to see performance increase when throwing more cores at it, as my example shows.
The API is completely backwards compatible, so that should not be an issue.
Check out the code here:
https://github.com/rphes/kmodes/tree/parallel
and let me know what you think!
The text was updated successfully, but these errors were encountered: