From 8502a442f777b9292244398015a30e20b0f041ed Mon Sep 17 00:00:00 2001
From: John Huddleston <huddlej@gmail.com>
Date: Mon, 12 Aug 2024 14:43:46 -0700
Subject: [PATCH] Clarify the source of outliers

---
 manuscript/cartography.tex | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/manuscript/cartography.tex b/manuscript/cartography.tex
index c6ecfde3..267942ac 100644
--- a/manuscript/cartography.tex
+++ b/manuscript/cartography.tex
@@ -596,12 +596,13 @@ \subsection{Selection of natural virus population data}
 
 For analyses that focused only on H3N2 HA data, we defined the early dataset between January 2016 and January 2018 and the late dataset between January 2018 to January 2020.
 These datasets reflected two years of recent H3N2 evolution up to the time when the SARS-CoV-2 pandemic disrupted seasonal influenza circulation.
-For both early and late datasets, we evenly sampled 25 sequences per country, year, and month, excluding known outliers.
+For both early and late datasets, we evenly sampled 25 sequences per country, year, and month.
+We excluded outliers which were sequences either labeled as environmental samples, containing over 100 gap characters within the HA sequence, or flagged by TreeTime \citep{Sagulenko2018} for having a phylogenetic divergence that exceeded four times the interquartile interval of residuals from a root-to-tip regression for all sequences in the same tree.
 With this sampling scheme, we selected 1,523 HA sequences for the early dataset and 1,073 for the late dataset.
 For analyses that combined H3N2 HA and NA data, we defined a single dataset between January 2016 and January 2018, keeping 1,607 samples for which both HA and NA have been sequenced.
 
 For SARS-CoV-2 data, we defined the early dataset between January 1, 2020 and January 1, 2022 and the late dataset between January 1, 2022 and November 3, 2023.
-For the early dataset, we evenly sampled 1,736 SARS-CoV-2 genomes by geographic region, year, and month, excluding known outliers.
+For the early dataset, we evenly sampled 1,736 SARS-CoV-2 genomes by geographic region, year, and month, excluding known outliers that had been previously identified by the Nextstrain team during weekly phylogenetic surveillance since January 2020 (\url{https://github.com/nextstrain/ncov/blob/master/defaults/exclude.txt}).
 For the late dataset, we used the same even sampling by space and time to select 1,309 representative genomes.
 In addition to these genomes, we identified all recombinant lineages in the official Pango designations as of November 3, 2023 (\url{https://github.com/cov-lineages/pango-designation/raw/1bf4123/pango_designation/alias_key.json}) for which the recombinant lineage and both of its parental lineages had at least 10 genome records each.
 We sampled at most 10 genomes per lineage for all distinct recombinant and parental lineages for a total of 1,157.