-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determining the optimal number of clusters #46
Comments
Simply by running the clustering for multiple k values, as there currently is no wrapper that does this for you automatically. It would be nice to combine this with the silhouette plot mentioned here PRs are welcomed. :) |
And how do you determine the optimal k for the k-prototypes? I am working on doing clustering on mixed categorical and numerical attributes. When I stumbled across your k-prototypes implementation, I want to implement it in my case. However, I'm a bit confused on how to evaluate the result from the k-prototypes algorithm (e.g. determine the optimal k). But as mentioned that silhouette plot would do the trick, I've been thinking to change the Euclidean distance into the k-prototypes cost function to determine the intra- and inter- cluster distance on silhouette analysis. Do you think that would work? |
I am also working with K-prototypes , and trying to find the optimal K value, can you please share your experience/approach to find optimal K when using K-prototypes,it would be great if you can share some code and links . Any suggestions for plotting very hight dimensional data , I am working with 56 features where I have 35 categorical columns[ 3 cols have about 10,000 categories and all others have about 10-12 categories] , 11 Numerical columns and 10 binary columns, with data size of 80 Million records ps: I am trying to find patterns and outliers , trying to find outliers that would not fit in with normal clusters, I am using health care data. Thank you in advance , any help is appreciated. |
Hi @nicodv, I'm working on an implementation of silhouette score, which uses dissimilarity (between each element of the array) as a distance metric and gives the optimal number of clusters, k. What other metric would you consider as a good basis for silhouette score calculation? |
hello, how to calculate the silhouette score in k prototypes, if I have a silhouette score of categorical data (hamming) and a silhouette score of numerical data (euclidean)? How would this weighted average be calculated? It could be done this way:
thanks |
@matiasscorsetti From reading the R implementation of "silhouette_kproto" line 1134 : Rdocumentation It seems to me they are weighting both silhouettes values like following: but I may be wrong... an idea @nicodv ? |
Hi I've been using kmodes (https://www.rdocumentation.org/packages/klaR/versions/0.6-12/topics/kmodes) from the KlaR, an R package to cluster my data set. I wanted to try using kmodes in python to see if I get similar results. However, I don't see how I can determine the optimal number of clusters in the python version of kmodes.
In the klaR package, I can use the $withindiff function to get the within-cluster simple-matching distance for each cluster. This allows me to calculate the sum of error for for k= 2, 3, 4...., etc. and select the optimal number of clusters based on the largest sum of error difference between each iteration of clustering with varying k values.
In the kmodes for python, how do you determine the optimal k?
The text was updated successfully, but these errors were encountered: