Determining the optimal number of clusters #46

eugeniahrho · 2017-06-10T20:07:21Z

Hi I've been using kmodes (https://www.rdocumentation.org/packages/klaR/versions/0.6-12/topics/kmodes) from the KlaR, an R package to cluster my data set. I wanted to try using kmodes in python to see if I get similar results. However, I don't see how I can determine the optimal number of clusters in the python version of kmodes.

In the klaR package, I can use the $withindiff function to get the within-cluster simple-matching distance for each cluster. This allows me to calculate the sum of error for for k= 2, 3, 4...., etc. and select the optimal number of clusters based on the largest sum of error difference between each iteration of clustering with varying k values.

In the kmodes for python, how do you determine the optimal k?

nicodv · 2017-06-16T03:41:02Z

Simply by running the clustering for multiple k values, as there currently is no wrapper that does this for you automatically.

It would be nice to combine this with the silhouette plot mentioned here

PRs are welcomed. :)

dexdimas · 2018-05-23T18:00:30Z

And how do you determine the optimal k for the k-prototypes?

I am working on doing clustering on mixed categorical and numerical attributes. When I stumbled across your k-prototypes implementation, I want to implement it in my case. However, I'm a bit confused on how to evaluate the result from the k-prototypes algorithm (e.g. determine the optimal k).

But as mentioned that silhouette plot would do the trick, I've been thinking to change the Euclidean distance into the k-prototypes cost function to determine the intra- and inter- cluster distance on silhouette analysis.

Do you think that would work?

doyager · 2019-03-10T18:41:42Z

@dexdimas

Hi @dexdimas , @nicodv , All

I am also working with K-prototypes , and trying to find the optimal K value, can you please share your experience/approach to find optimal K when using K-prototypes,it would be great if you can share some code and links .

Any suggestions for plotting very hight dimensional data , I am working with 56 features where I have 35 categorical columns[ 3 cols have about 10,000 categories and all others have about 10-12 categories] , 11 Numerical columns and 10 binary columns, with data size of 80 Million records

ps: I am trying to find patterns and outliers , trying to find outliers that would not fit in with normal clusters, I am using health care data.

Thank you in advance , any help is appreciated.

supreetkt · 2019-03-21T16:59:52Z

Hi @nicodv,

I'm working on an implementation of silhouette score, which uses dissimilarity (between each element of the array) as a distance metric and gives the optimal number of clusters, k. What other metric would you consider as a good basis for silhouette score calculation?

PabloVergara · 2019-05-08T15:15:33Z

Using silhouette for the numerical variables, and continue using the cost for all
with a small change here in kprototypes.py

and this piece of code in the implementation:

lista=[]
for i in range(20,23):
    nc=i
    start = time.time()
    kp = KPrototypes(n_clusters = nc, init = 'Cao', n_init =22, verbose = 1, random_state=4, n_jobs=8 )
    clusters=kp.fit_predict(data.values, categorical = [9])
    end = time.time()
    lista.append([i,"Silhouette Coefficient: %0.3f"% metrics.silhouette_score(data.iloc[:,0:9], kp.labels_),'cost: %0.3f'%kp.cost_,
                                     'tiempo (s): %0.3f'% (end-start),'best run: %0.3f'% (list(kp.best.keys())[0]+1)])

you can have a half result

matiasscorsetti · 2019-10-05T20:15:13Z

hello,

how to calculate the silhouette score in k prototypes, if I have a silhouette score of categorical data (hamming) and a silhouette score of numerical data (euclidean)?
Should I average weighted between the two coefficients according to the gamma value?

How would this weighted average be calculated?

It could be done this way:

( silhouette_category * kp.gamma ) + ( silhouette_numeric * (1 - kp.gamma ) )

thanks

arnaud-nt2i · 2021-03-12T16:34:03Z

@matiasscorsetti
gamma is not from [0,1] (a proportionality coef) but from [0,+inf[

From reading the R implementation of "silhouette_kproto" line 1134 : Rdocumentation
(gamma is called lambda there)

It seems to me they are weighting both silhouettes values like following:
( silhouette_category * gamma ) + ( silhouette_numeric )

but I may be wrong...

an idea @nicodv ?

nicodv added the enhancement label Jun 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determining the optimal number of clusters #46

Determining the optimal number of clusters #46

eugeniahrho commented Jun 10, 2017 •

edited

Loading

nicodv commented Jun 16, 2017

dexdimas commented May 23, 2018

doyager commented Mar 10, 2019 •

edited

Loading

supreetkt commented Mar 21, 2019 •

edited

Loading

PabloVergara commented May 8, 2019

matiasscorsetti commented Oct 5, 2019

arnaud-nt2i commented Mar 12, 2021 •

edited

Loading

Determining the optimal number of clusters #46

Determining the optimal number of clusters #46

Comments

eugeniahrho commented Jun 10, 2017 • edited Loading

nicodv commented Jun 16, 2017

dexdimas commented May 23, 2018

doyager commented Mar 10, 2019 • edited Loading

supreetkt commented Mar 21, 2019 • edited Loading

PabloVergara commented May 8, 2019

matiasscorsetti commented Oct 5, 2019

arnaud-nt2i commented Mar 12, 2021 • edited Loading

eugeniahrho commented Jun 10, 2017 •

edited

Loading

doyager commented Mar 10, 2019 •

edited

Loading

supreetkt commented Mar 21, 2019 •

edited

Loading

arnaud-nt2i commented Mar 12, 2021 •

edited

Loading