From 63f6779baccbf58e40caf10193a82af46b149a19 Mon Sep 17 00:00:00 2001 From: John Huddleston Date: Mon, 19 Aug 2024 14:58:32 -0700 Subject: [PATCH] Clarify discussion of "recombinant" samples --- manuscript/cartography.tex | 3 ++- manuscript/cartography_supplement.tex | 1 + 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/manuscript/cartography.tex b/manuscript/cartography.tex index 19e55fbe..96d6a49b 100644 --- a/manuscript/cartography.tex +++ b/manuscript/cartography.tex @@ -419,7 +419,8 @@ \subsection{SARS-CoV-2 clusters recapitulate broad genetic groups corresponding To test the optimal cluster parameters identified above, we applied embedding methods to late SARS-CoV-2 data and compared clusters from these embeddings to the corresponding Nextstrain clades and Pango lineages. Compared to the 17 Nextstrain clades defined in this time period (Supplementary Fig.~S\ref{S_Fig_sarscov2_late_embeddings_by_Nextstrain_clade}), the closest clusters were from t-SNE (normalized VI=0.09 with 66 clusters) and UMAP (normalized VI=0.09 with 13 clusters, Fig.~\ref{fig:sars-cov-2-2022-2023-clusters-vs-Nextstrain-clade} and Supplementary Table~S\ref{S_Table_optimal_cluster_parameters}). -We attributed t-SNE's additional clusters to recombinant lineages that were genetically distinct but which received a generic ``recombinant'' label in Nextstrain's clade definitions instead of a unique clade name. +We attributed t-SNE's additional clusters to recombinant lineages that were genetically distinct but which received a generic ``recombinant'' label in Nextstrain's clade definitions instead of a unique clade name (Supplementary Fig.~S\ref{S_Fig_sarscov2_late_embeddings_by_Nextstrain_clade}). +Although we did not consider these non-monophyletic recombinant samples when calculating VI distances between clusters and Nextstrain clades, these samples appear in each embedding where they could form their own distinct clusters. Only t-SNE, UMAP, and genetic distance clusters were fully monophyletic (Supplementary Table~S\ref{S_Table_monophyletic_clusters}). Genetic distance, PCA, and t-SNE clusters were best supported by cluster-specific mutations with 16 of 17 clusters (94\%), 6 of 7 clusters (86\%), and 51 of 66 clusters (77\%), respectively (Supplementary Table~S\ref{S_Table_mutations_per_cluster}). Clusters from t-SNE had the lowest average within-group distances (Supplementary Fig.~S\ref{S_Fig_sarscov2_within_between_group_distances}). diff --git a/manuscript/cartography_supplement.tex b/manuscript/cartography_supplement.tex index 45ecc950..167f4255 100644 --- a/manuscript/cartography_supplement.tex +++ b/manuscript/cartography_supplement.tex @@ -204,6 +204,7 @@ \section*{Supplementary data} \includegraphics[width=0.9\columnwidth]{figures/sarscov2-test-embeddings-by-Nextstrain_clade-clade.png} \caption{{\bf Phylogeny of late (2022--2023) SARS-CoV-2 sequences plotted by number of nucleotide substitutions from the most recent common ancestor on the x-axis (top) and low-dimensional embeddings of the same sequences by PCA (middle left), MDS (middle right), t-SNE (bottom left), and UMAP (bottom right).} Tips in the tree and embeddings are colored by their Nextstrain clade assignment. + Tips that could not be assigned to a predefined Nextstrain clade due to recombination were colored as ``recombinant''. Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods). Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny. Clade labels in the tree and embeddings highlight larger clades.