Skip to content

Commit

Permalink
Merge pull request #124 from blab/refine-figure-aesthetics
Browse files Browse the repository at this point in the history
Refine figure aesthetics
  • Loading branch information
huddlej authored Aug 19, 2024
2 parents 04b9eab + a2afb79 commit afd1574
Show file tree
Hide file tree
Showing 72 changed files with 880 additions and 307 deletions.
279 changes: 127 additions & 152 deletions ha-na-nextstrain/2022-02-23-seasonal-flu-ha-na-reassortment.ipynb

Large diffs are not rendered by default.

25 changes: 16 additions & 9 deletions manuscript/cartography.tex
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,9 @@ \subsection{Embedding clusters recapitulate phylogenetic clades for seasonal inf
Tips in the tree and embeddings are colored by their Nextstrain clade assignment.
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line colors represent the clade membership of the most ancestral node in the pair of nodes connected by the segment.
Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
Clade labels appear in the tree at the earliest ancestral node of the tree for each clade.
Clade labels appear in each embedding at the average position on the x and y axis for sequences in a given clade.
}
\label{fig:seasonal-influenza-h3n2-ha-embeddings}
\end{figure}
Expand Down Expand Up @@ -277,7 +279,7 @@ \subsection{Embedding clusters recapitulate phylogenetic clades for seasonal inf
\caption{{\bf Phylogenetic trees (left) and embeddings (right) of early (2016--2018) influenza H3N2 HA sequences colored by HDBSCAN cluster.}
Normalized VI values per embedding reflect the distance between clusters and known genetic groups (Nextstrain clades).
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.}
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.}
\label{fig:seasonal-influenza-h3n2-ha-2016-2018-clusters}
\end{figure}

Expand All @@ -299,7 +301,7 @@ \subsection{Embedding clusters recapitulate phylogenetic clades for seasonal inf
\caption{{\bf Phylogenetic trees (left) and embeddings (right) of late (2018--2020) H3N2 HA sequences colored by HDBSCAN cluster.}
Normalized VI values per embedding reflect the distance between clusters and known genetic groups (Nextstrain clades).
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.}
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.}
\label{fig:seasonal-influenza-h3n2-ha-2018-2020-clusters}
\end{figure}

Expand Down Expand Up @@ -329,17 +331,21 @@ \subsection{Joint embeddings of hemagglutinin and neuraminidase genomes identify
Clusters from genetic distances improved the most by the addition of NA from a normalized VI of 0.2 to 0.11 (Supplementary Table~\ref{S_Table_optimal_cluster_parameters}).
Embeddings with both genes produced more clusters for all methods than the HA-only embeddings with 3 additional clusters in PCA (Supplementary Fig.~S\ref{S_Fig_flu_ha_na_pca_embeddings}), 9 in MDS (Supplementary Fig.~S\ref{S_Fig_flu_ha_na_mds_embeddings}), 2 in t-SNE (Supplementary Fig.~S\ref{S_Fig_flu_ha_na_tsne_embeddings} and Supplementary Fig.~S\ref{S_Fig_flu_ha_na_tsne_mcc_counts}), 1 in UMAP (Supplementary Fig.~S\ref{S_Fig_flu_ha_na_umap_embeddings}), and 16 in genetic distance clusters (Supplementary Table~\ref{S_Table_optimal_cluster_parameters}).
All embeddings of HA/NA alignments produced distinct clusters for the known reassortment event within clade A2 \citep{Potter2019} as represented by MCCs 14 and 11.
Other larger events like those represented by MCCs 9 and 12 mapped far apart in all HA/NA embeddings except PCA.
Other pairs of larger reassortment events that occurred in the same part of the HA tree like MCCs 9 and 12 or MCCs 5 and 10 mapped farther apart in all HA/NA embeddings compared to HA-only embeddings (Supplementary Fig.~S\ref{S_Fig_full_ha_na_embeddings}).
We noted that some of the additional clusters in HA/NA embeddings likely also reflected genetic diversity in NA that was independent of reassortment between HA and NA.
These results suggest that a single embedding of multiple gene segments could identify biologically meaningful clusters within and between all genes.

\begin{figure}[!h]
\includegraphics[width=0.9\columnwidth]{figures/flu-2016-2018-ha-na-embeddings-by-mcc.png}
\caption{{\bf Phylogeny of early (2016--2018) influenza H3N2 HA sequences plotted by nucleotide substitutions per site on the x-axis (top) and low-dimensional embeddings of the same HA sequences concatenated with matching NA sequences by PCA (middle left), MDS (middle right), t-SNE (bottom left), and UMAP (bottom right).}
Tips in the tree and embeddings are colored by their TreeKnit Maximally Compatible Clades (MCCs) label which represents putative HA/NA reassortment groups.
Tips from MCCs with fewer than 10 sequences are colored as ``unassigned''.
The first normalized VI values per embedding reflect the distance between HA/NA clusters and known genetic groups (MCCs).
VI values in parentheses reflect the distance between HA-only clusters and known genetic groups.
``A2'' and ``A2/re'' labels indicate a known reassortment event \citep{Potter2019}.
MCC labels appear in the tree and each embedding for larger pairs of reassortment events.
MCC 9 represents two Nextstrain clades, so its labels appear twice in the tree.
MCCs 14 and 11 represent a previously published reassortment event within Nextstrain clade A2 \citep{Potter2019}.
Labels for MCC 14 represent the subset of its sequences from clade A2.
}
\label{fig:seasonal-influenza-h3n2-ha-na-2016-2018-embeddings}
\end{figure}
Expand All @@ -365,7 +371,8 @@ \subsection{SARS-CoV-2 clusters recapitulate broad genetic groups corresponding
\caption{{\bf Phylogeny of early (2020--2022) SARS-CoV-2 sequences plotted by number of nucleotide substitutions from the most recent common ancestor on the x-axis (top) and low-dimensional embeddings of the same sequences by PCA (middle left), MDS (middle right), t-SNE (bottom left), and UMAP (bottom right).}
Tips in the tree and embeddings are colored by their Nextstrain clade assignment.
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
Clade labels in the tree and embeddings highlight larger clades.
}
\label{fig:sars-cov-2-early-embeddings-by-Nextstrain-clade}
\end{figure}
Expand Down Expand Up @@ -405,13 +412,13 @@ \subsection{SARS-CoV-2 clusters recapitulate broad genetic groups corresponding
\caption{{\bf Phylogenetic trees (left) and embeddings (right) of early (2020--2022) SARS-CoV-2 sequences colored by HDBSCAN cluster.}
Normalized VI values per embedding reflect the distance between clusters and known genetic groups (Nextstrain clades).
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
}
\label{fig:sars-cov-2-2020-2022-clusters-vs-Nextstrain-clade}
\end{figure}

To test the optimal cluster parameters identified above, we applied embedding methods to late SARS-CoV-2 data and compared clusters from these embeddings to the corresponding Nextstrain clades and Pango lineages.
Compared to the 18 Nextstrain clades defined in this time period, the closest clusters were from t-SNE (normalized VI=0.09 with 66 clusters) and UMAP (normalized VI=0.09 with 13 clusters, Fig.~\ref{fig:sars-cov-2-2022-2023-clusters-vs-Nextstrain-clade} and Supplementary Table~S\ref{S_Table_optimal_cluster_parameters}).
Compared to the 17 Nextstrain clades defined in this time period (Supplementary Fig.~S\ref{S_Fig_sarscov2_late_embeddings_by_Nextstrain_clade}), the closest clusters were from t-SNE (normalized VI=0.09 with 66 clusters) and UMAP (normalized VI=0.09 with 13 clusters, Fig.~\ref{fig:sars-cov-2-2022-2023-clusters-vs-Nextstrain-clade} and Supplementary Table~S\ref{S_Table_optimal_cluster_parameters}).
We attributed t-SNE's additional clusters to recombinant lineages that were genetically distinct but which received a generic ``recombinant'' label in Nextstrain's clade definitions instead of a unique clade name.
Only t-SNE, UMAP, and genetic distance clusters were fully monophyletic (Supplementary Table~S\ref{S_Table_monophyletic_clusters}).
Genetic distance, PCA, and t-SNE clusters were best supported by cluster-specific mutations with 16 of 17 clusters (94\%), 6 of 7 clusters (86\%), and 51 of 66 clusters (77\%), respectively (Supplementary Table~S\ref{S_Table_mutations_per_cluster}).
Expand All @@ -428,7 +435,7 @@ \subsection{SARS-CoV-2 clusters recapitulate broad genetic groups corresponding
\label{fig:sars-cov-2-2022-2023-clusters-vs-Nextstrain-clade}
\end{figure}

All methods produced less accurate representations of the 137 Pango lineages (Supplementary Fig.~S\ref{S_Fig_sarscov2_late_embeddings_by_cluster_vs_Nextclade_pango} and Supplementary Table~\ref{S_Table_optimal_cluster_parameters}).
All methods produced less accurate representations of the 137 Pango lineages (Supplementary Figs.~S\ref{S_Fig_sarscov2_late_embeddings_by_Pango} and S\ref{S_Fig_sarscov2_late_embeddings_by_cluster_vs_Nextclade_pango} and Supplementary Table~\ref{S_Table_optimal_cluster_parameters}).
However, t-SNE clusters were nearly as accurate with a normalized VI of 0.14, suggesting that t-SNE's numerous additional clusters likely did represent many of the recombinant Pango lineages in the dataset that all received a ``recombinant'' Nextstrain clade label.
Of the 80 recombinant Pango lineages that also had a t-SNE cluster, 79 (99\%) mapped to a single t-SNE cluster (Supplementary Fig.~S\ref{S_Fig_sarscov2_late_embeddings_tsne_recombinant_counts}).
Of the 52 t-SNE clusters with recombinant Pango lineages, 43 (83\%) mapped to a single Pango lineage.
Expand Down
48 changes: 41 additions & 7 deletions manuscript/cartography_supplement.tex
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,9 @@ \section*{Supplementary data}
\caption{{\bf MDS embeddings for early (2016--2018) influenza H3N2 HA sequences showing all three components.}
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line colors represent the clade membership of the most ancestral node in the pair of nodes connected by the segment.
Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.}\label{S_Fig_early_flu_mds_embeddings}
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
Clade labels appear in the tree at the earliest ancestral node of the tree for each clade.
Clade labels appear in each embedding at the average position on the x and y axis for sequences in a given clade.}\label{S_Fig_early_flu_mds_embeddings}
\end{figure}

\begin{figure}[!h]
Expand All @@ -102,15 +104,19 @@ \section*{Supplementary data}
Tips in the tree and embeddings are colored by their Nextstrain clade assignment.
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line colors represent the clade membership of the most ancestral node in the pair of nodes connected by the segment.
Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.}\label{S_Fig_late_flu_embeddings_by_clade}
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
Clade labels appear in the tree at the earliest ancestral node of the tree for each clade.
Clade labels appear in each embedding at the average position on the x and y axis for sequences in a given clade.}\label{S_Fig_late_flu_embeddings_by_clade}
\end{figure}

\begin{figure}[!h]
\includegraphics[width=\columnwidth]{figures/flu-2018-2020-mds-by-clade.png}
\caption{{\bf MDS embeddings for late (2018--2020) influenza H3N2 HA sequences showing all three components.}
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line colors represent the clade membership of the most ancestral node in the pair of nodes connected by the segment.
Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.}\label{S_Fig_late_flu_mds_embeddings}
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
Clade labels appear in the tree at the earliest ancestral node of the tree for each clade.
Clade labels appear in each embedding at the average position on the x and y axis for sequences in a given clade.}\label{S_Fig_late_flu_mds_embeddings}
\end{figure}

\begin{figure}[!h]
Expand All @@ -124,7 +130,11 @@ \section*{Supplementary data}
\begin{figure}[!h]
\includegraphics[width=0.75\columnwidth]{figures/flu-2016-2018-ha-na-all-embeddings-by-mcc.png}
\caption{{\bf Embeddings influenza H3N2 HA-only (left) and combined HA/NA (right) showing the effects of additional NA genetic information on the placement of reassortment events detected by TreeKnit (MCCs).}
Normalized VI values quantify the degree to which the combination of HA and NA sequences in an embedding reduces the distance of embedding clusters to TreeKnit reassortment groups represented by MCCs.}\label{S_Fig_full_ha_na_embeddings}
Sequences from MCCs with fewer than 10 sequences are colored as ``unassigned''.
Normalized VI values quantify the degree to which the combination of HA and NA sequences in an embedding reduces the distance of embedding clusters to TreeKnit reassortment groups represented by MCCs.
MCC labels for larger pairs of reassortment events appear in each embedding at the average position on the x and y axis for sequences in a given MCC.
MCCs 14 and 11 represent a previously published reassortment event within Nextstrain clade A2 \citep{Potter2019}.
Labels for MCC 14 represents the sequences from clade A2.}\label{S_Fig_full_ha_na_embeddings}
\end{figure}

\begin{figure}[!h]
Expand Down Expand Up @@ -160,15 +170,17 @@ \section*{Supplementary data}
\includegraphics[width=\columnwidth]{figures/sarscov2-mds-by-Nextstrain_clade-clade.png}
\caption{{\bf MDS embeddings for early SARS-CoV-2 sequences showing all three components.}
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.}\label{S_Fig_sarscov2_early_mds}
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
Clade labels in the tree and embeddings highlight larger clades.}\label{S_Fig_sarscov2_early_mds}
\end{figure}

\begin{figure}[!h]
\includegraphics[width=\columnwidth]{figures/sarscov2-embeddings-by-Nextclade_pango_collapsed-clade.png}
\caption{{\bf Phylogeny of early (2020--2022) SARS-CoV-2 sequences plotted by number of nucleotide substitutions from the most recent common ancestor on the x-axis (top) and low-dimensional embeddings of the same sequences by PCA (middle left), MDS (middle right), t-SNE (bottom left), and UMAP (bottom right).}
Tips in the tree and embeddings are colored by their Pango lineage assignment.
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
Clade labels in the tree and embeddings highlight larger Pango lineages.
}\label{S_Fig_sarscov2_early_embeddings_by_Nextclade_pango}
\end{figure}

Expand All @@ -184,10 +196,32 @@ \section*{Supplementary data}
\caption{{\bf Phylogenetic trees (left) and embeddings (right) of early (2020--2022) SARS-CoV-2 sequences colored by HDBSCAN cluster.}
Normalized VI values per embedding reflect the distance between clusters and known genetic groups (Pango lineages).
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line thickness scales by the square root of the number of leaves descending from a given node in the phylogeny.
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
}\label{S_Fig_sarscov2_early_embeddings_by_cluster_vs_Nextclade_pango}
\end{figure}

\begin{figure}[!h]
\includegraphics[width=0.9\columnwidth]{figures/sarscov2-test-embeddings-by-Nextstrain_clade-clade.png}
\caption{{\bf Phylogeny of late (2022--2023) SARS-CoV-2 sequences plotted by number of nucleotide substitutions from the most recent common ancestor on the x-axis (top) and low-dimensional embeddings of the same sequences by PCA (middle left), MDS (middle right), t-SNE (bottom left), and UMAP (bottom right).}
Tips in the tree and embeddings are colored by their Nextstrain clade assignment.
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
Clade labels in the tree and embeddings highlight larger clades.
}
\label{S_Fig_sarscov2_late_embeddings_by_Nextstrain_clade}
\end{figure}

\begin{figure}[!h]
\includegraphics[width=0.9\columnwidth]{figures/sarscov2-test-embeddings-by-Nextclade_pango_collapsed-clade.png}
\caption{{\bf Phylogeny of late (2022--2023) SARS-CoV-2 sequences plotted by number of nucleotide substitutions from the most recent common ancestor on the x-axis (top) and low-dimensional embeddings of the same sequences by PCA (middle left), MDS (middle right), t-SNE (bottom left), and UMAP (bottom right).}
Tips in the tree and embeddings are colored by their Pango lineage assignment.
Line segments in each embedding reflect phylogenetic relationships with internal node positions calculated from the mean positions of their immediate descendants in each dimension (see Methods).
Line thickness in the embeddings scales by the square root of the number of leaves descending from a given node in the phylogeny.
Clade labels in the tree and embeddings highlight larger Pango lineages.
}
\label{S_Fig_sarscov2_late_embeddings_by_Pango}
\end{figure}

\begin{figure}[!h]
\includegraphics[width=\columnwidth]{figures/sarscov2-test-replication-of-cluster-accuracy.png}
\caption{{\bf Replication of cluster accuracy per embedding method for late (2022--2023) SARS-CoV-2 sequences across different sampling densities (total sequences sampled) and sampling schemes including A) even geographic and temporal sampling and B) random sampling.}
Expand Down

Large diffs are not rendered by default.

Binary file modified manuscript/figures/flu-2016-2018-ha-embeddings-by-clade.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

Binary file modified manuscript/figures/flu-2016-2018-ha-embeddings-by-cluster.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

Binary file modified manuscript/figures/flu-2016-2018-ha-na-embeddings-by-mcc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion manuscript/figures/flu-2016-2018-ha-na-mds-by-cluster.html

Large diffs are not rendered by default.

Binary file modified manuscript/figures/flu-2016-2018-ha-na-mds-by-cluster.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion manuscript/figures/flu-2016-2018-ha-na-pca-by-cluster.html

Large diffs are not rendered by default.

Binary file modified manuscript/figures/flu-2016-2018-ha-na-pca-by-cluster.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

Binary file modified manuscript/figures/flu-2016-2018-ha-na-tsne-by-cluster.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified manuscript/figures/flu-2016-2018-ha-na-tsne-mcc-counts.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

Binary file modified manuscript/figures/flu-2016-2018-ha-na-umap-by-cluster.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion manuscript/figures/flu-2016-2018-mds-by-clade.html

Large diffs are not rendered by default.

Binary file modified manuscript/figures/flu-2016-2018-mds-by-clade.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

Binary file modified manuscript/figures/flu-2018-2020-ha-embeddings-by-clade.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

Binary file modified manuscript/figures/flu-2018-2020-ha-embeddings-by-cluster.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion manuscript/figures/flu-2018-2020-mds-by-clade.html

Large diffs are not rendered by default.

Binary file modified manuscript/figures/flu-2018-2020-mds-by-clade.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Large diffs are not rendered by default.

Loading

0 comments on commit afd1574

Please sign in to comment.