Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster selection based on stability is really sensitive to the selection of fall_off_size #43

Open
chongxi opened this issue Jul 7, 2016 · 1 comment

Comments

@chongxi
Copy link

chongxi commented Jul 7, 2016

Thanks for providing this amazing library. I learnt a lot from your implementation. Among all clustering scheme I've tried, this is one of the best so far.

I spent a while to study your amazing Cython implementation and the idea behind. Minimum spanning tree is really amazingly informative. However, in practical use, I am always feeling there is a problem on the automatic cluster selection. It is just too sensitive to the selection of fall_off_size(min_sample). Besides, when two obvious distinguish clusters are connected by very few noisy points in between, it is more likely they would be put together. I understand mutual distance is used to address that, but automatic selection based on stability seems to bring the quality down.

Before the automatic selection I think everything is perfect, you do have a minimum spanning tree to cut, a single linkage tree to do any migration, split or merge. I do feel there would be some room to improve or even replace the condense tree for automatic selection of cluster.

@lmcinnes
Copy link
Collaborator

lmcinnes commented Jul 8, 2016

Thanks for the compliments on the implementation. I admit that the automatic cluster selection may not be to everyone's taste, but it is a good default for a large number of cases. Since the single_linkage_tree_ and condensed_tree_ are both exposed as attributes of the model after fitting I feel those who wish to do something else are able to should they desire to do so.

On the other hand, if the question is one of sensitivity to the min_samples parameter (rather than min_cluster_size) I may have some answers there. I have been working on a different algorithms that essentially operates over all (or potentially just many) min_samples values and computes a total stability over the combined epsilon and min_samples space. This requires some significant rethinking of how to interpret the algorithm, and I've been drawing heavily from persistent homology (and more accurately persistent homotopy) theory to get something workable. There are still a number of details to hammer out and some work to be done to ensure the resulting algorithm really does return useful clusterings, but I believe it has significant promise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants