Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel implementation #76

Closed
rphes opened this issue Jun 2, 2018 · 6 comments
Closed

Parallel implementation #76

rphes opened this issue Jun 2, 2018 · 6 comments

Comments

@rphes
Copy link
Contributor

rphes commented Jun 2, 2018

Hello again,

The K-Prototypes algorithm continues to intrigue me, so I played around with it some more. I did the Cython implementation before, as discussed in #57, but the changes required for that were so far-reaching I can imagine merging the back into this repository is challenging.

In the meantime, I implemented both K-Modes and K-Prototypes in a parallel fashion, basically copying the strategy applied in scikit-learn, namely doing multiple runs with different initializations in parallel using joblib.
Minimal code changes are required to accomplish this, so it might make a better initial candidate to improve performance. Most changes are related to the way the algorithms deal with random variables, as this becomes a little more difficult to keep reproducible when adding threading into the mix.

I did some testing on a dataset that I unfortunately cannot share, the results of which are in this gist:
https://gist.github.com/rphes/48569eb0c929d33deef18c9de0d96aa8

Interesting observations are that K-Prototypes benefits from multithreading whereas K-Modes really does not. I think this might be due to the very low complexity of the K-Modes algorithm, making it memory-bound. Threading therefore only introduces additional overhead due to the forking process and resource contention between threads.
K-Prototypes performs some actual computations, so it is less affected by this, but still we see performance increase level off at 4 cores, presumable for the same reason.

Now, if you'd like to incorporate this stuff, I'm more than happy to submit a PR, with a parallel version of K-Prototypes, or both algorithms with added tests and documentation. The only caveat is that users need to consider whether their use-case is actually going to see performance increase when throwing more cores at it, as my example shows.
The API is completely backwards compatible, so that should not be an issue.

Check out the code here:
https://github.com/rphes/kmodes/tree/parallel
and let me know what you think!

@nicodv
Copy link
Owner

nicodv commented Jun 11, 2018

Great, thanks for your contribution, @rphes ! I'll try to find some time to go over the code in detail, but it looks good at first glance.

FYI, beyond using your own data set, you could play around with the benchmark script in the examples folder. I'm sure you can find scenarios that show performance improvements on K-Modes too.

A PR with parallel implementations for both algorithms is very welcome. I suggest the default is to set n_jobs=1, just like sklearn does.

@rphes rphes mentioned this issue Jul 19, 2018
@nicodv
Copy link
Owner

nicodv commented Jul 19, 2018

@rphes , I've merged the PR.

As part of this ticket, could you please could update the examples and readme to showcase this new feature?

@rphes
Copy link
Contributor Author

rphes commented Jul 19, 2018

Great! Will do

@nicodv
Copy link
Owner

nicodv commented Jul 19, 2018

Hee, het valt me nu pas op dat je aan mijn oude universiteit studeert! 😄

@rphes
Copy link
Contributor Author

rphes commented Jul 19, 2018

Haha, 'vo! Een maandje nog en dan ben ik klaar! Ik had ook wel zo'n vermoeden door je naam en de techjob in de states.

@nicodv
Copy link
Owner

nicodv commented Jul 24, 2018

This was merged. Thanks for the neat contribution, @rphes !

@nicodv nicodv closed this as completed Jul 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants