-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
txc extremely low with MDS and still low with tsne #127
Comments
Hi, glad that you are finding the package useful! Yes, I think it's fair to interpret the mapping with the highest txc score as being the most faithful representation of the true tree-to-tree distances. This doesn't necessarily mean that it will also do the best job of capturing all of the clustering structure, as it seems you have found. Clustering structure has many aspects (volume of clusters, their spatial relationships, distance between clusters, homogeneity of point density within clusters...) and different mappings may do a better or worse job of portraying different of these aspects. If you are conducting the silhouette coefficient calculation on the original tree-to-tree distances, this should be a true reflection of the distinctness of the clusters. If you are performing the calculation on the mapped distances, then this will show how much of the clustering structure survived the mapping process – which is likely to be much less than the true structure. |
It is on the original tree-to-tree distances (Clustering Information Distance). So that would suggest that my clusters are in fact not distinct? That doesn't make a lot of sense to me. |
The next thing I'd look at would be the silhouette score of individual points, using cluster::silhouette. |
First, thanks for this extremely useful package and detailed explanations on tree metrics and visualizations in treespace as well as assessments of those visualizations.
I have a question about the relationship between assessment metrics like the silhouette coefficient and trustworthiness x continuity score and the properties of the trees.
I have a set of 220 (well, actually more like 11,000 but computing the distances was very slow so I just subsampled 220 of them) trees that I know fall into 11 different clusters. I wanted to visualize them in 2D tree space to get an understanding of how the clusters of trees differ, so I computed the Clustering Information Distance based off of the outlined recommendation in the tree space analysis vignette. I then plotted the trees with PCoA (with
cmdscale
) as well as a tSNE (withRtsne
), and additionally a UMAP (withuwot
) just to see. The PCoA looked quite good, the tSNE looked interesting, and the UMAP looked rather similar to the tSNE when I modified the spread parameter.However the txc score for both the PCoA and the tSNE is very low, well below 0.9 although the txc score for the tSNE is somewhat higher. Additionally, I tried the silhouette coefficient calculations outlined in the vignette, and got silhouette coefficients below 0.15. I am not really trying to cluster my trees since I have the clusters a priori, but I thought the coefficient would be higher to show that there is meaningful structure.
Could you help me understand what could be possible reasons for this? I am fairly certain the clusters have distinct features separating them. Each tree has ~1200 tips, so I was wondering if larger trees could result in lower scores due to the exploding number of possible topologies. I am happy to send you either the phylogenies or the distance matrix as well if it would be helpful.
PCoA and txc (colors are different for different known clusters)
tSNE and txc:
The text was updated successfully, but these errors were encountered: