Parallel implementation #76

rphes · 2018-06-02T13:19:28Z

Hello again,

The K-Prototypes algorithm continues to intrigue me, so I played around with it some more. I did the Cython implementation before, as discussed in #57, but the changes required for that were so far-reaching I can imagine merging the back into this repository is challenging.

In the meantime, I implemented both K-Modes and K-Prototypes in a parallel fashion, basically copying the strategy applied in scikit-learn, namely doing multiple runs with different initializations in parallel using joblib.
Minimal code changes are required to accomplish this, so it might make a better initial candidate to improve performance. Most changes are related to the way the algorithms deal with random variables, as this becomes a little more difficult to keep reproducible when adding threading into the mix.

I did some testing on a dataset that I unfortunately cannot share, the results of which are in this gist:
https://gist.github.com/rphes/48569eb0c929d33deef18c9de0d96aa8

Interesting observations are that K-Prototypes benefits from multithreading whereas K-Modes really does not. I think this might be due to the very low complexity of the K-Modes algorithm, making it memory-bound. Threading therefore only introduces additional overhead due to the forking process and resource contention between threads.
K-Prototypes performs some actual computations, so it is less affected by this, but still we see performance increase level off at 4 cores, presumable for the same reason.

Now, if you'd like to incorporate this stuff, I'm more than happy to submit a PR, with a parallel version of K-Prototypes, or both algorithms with added tests and documentation. The only caveat is that users need to consider whether their use-case is actually going to see performance increase when throwing more cores at it, as my example shows.
The API is completely backwards compatible, so that should not be an issue.

Check out the code here:
https://github.com/rphes/kmodes/tree/parallel
and let me know what you think!

nicodv · 2018-06-11T17:13:03Z

Great, thanks for your contribution, @rphes ! I'll try to find some time to go over the code in detail, but it looks good at first glance.

FYI, beyond using your own data set, you could play around with the benchmark script in the examples folder. I'm sure you can find scenarios that show performance improvements on K-Modes too.

A PR with parallel implementations for both algorithms is very welcome. I suggest the default is to set n_jobs=1, just like sklearn does.

nicodv · 2018-07-19T17:13:18Z

@rphes , I've merged the PR.

As part of this ticket, could you please could update the examples and readme to showcase this new feature?

rphes · 2018-07-19T17:15:34Z

Great! Will do

nicodv · 2018-07-19T17:20:31Z

Hee, het valt me nu pas op dat je aan mijn oude universiteit studeert! 😄

rphes · 2018-07-19T17:32:59Z

Haha, 'vo! Een maandje nog en dan ben ik klaar! Ik had ook wel zo'n vermoeden door je naam en de techjob in de states.

nicodv · 2018-07-24T22:27:58Z

This was merged. Thanks for the neat contribution, @rphes !

nicodv added the enhancement label Jun 11, 2018

rphes mentioned this issue Jul 19, 2018

Parallel #83

Merged

nicodv closed this as completed Jul 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel implementation #76

Parallel implementation #76

rphes commented Jun 2, 2018

nicodv commented Jun 11, 2018 •

edited

Loading

nicodv commented Jul 19, 2018

rphes commented Jul 19, 2018

nicodv commented Jul 19, 2018

rphes commented Jul 19, 2018

nicodv commented Jul 24, 2018

Parallel implementation #76

Parallel implementation #76

Comments

rphes commented Jun 2, 2018

nicodv commented Jun 11, 2018 • edited Loading

nicodv commented Jul 19, 2018

rphes commented Jul 19, 2018

nicodv commented Jul 19, 2018

rphes commented Jul 19, 2018

nicodv commented Jul 24, 2018

nicodv commented Jun 11, 2018 •

edited

Loading