Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use gene reference files to generate E gene trees #48

Merged
merged 8 commits into from
May 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion phylogenetic/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ wildcard_constraints:

rule all:
input:
auspice_json = expand("auspice/dengue_{serotype}_{gene}.json", serotype=serotypes, gene='genome'),
auspice_json = expand("auspice/dengue_{serotype}_{gene}.json", serotype=serotypes, gene=genes),

include: "rules/prepare_sequences.smk"
include: "rules/prepare_sequences_E.smk"
Expand Down
7 changes: 6 additions & 1 deletion phylogenetic/config/color_orderings.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -235,15 +235,20 @@ recency New

################

ncbi_serotype denv1
ncbi_serotype denv2
ncbi_serotype denv3
ncbi_serotype denv4

nextclade_subtype DENV1/I
nextclade_subtype DENV1/II
nextclade_subtype DENV1/III
nextclade_subtype DENV1/IV
nextclade_subtype DENV1/V
nextclade_subtype DENV2/AM
nextclade_subtype DENV2/AA
nextclade_subtype DENV2/AI
nextclade_subtype DENV2/AII
nextclade_subtype DENV2/AM
nextclade_subtype DENV2/C
nextclade_subtype DENV2/S
nextclade_subtype DENV3/I
Expand Down
14 changes: 8 additions & 6 deletions phylogenetic/config/config_dengue.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@ display_strain_field: "strain"
filter:
exclude: "config/exclude.txt"
group_by: "year region"
min_length: 5000
min_length:
genome: 5000
E: 1000
sequences_per_group:
all: '10'
denv1: '36'
Expand All @@ -15,11 +17,11 @@ filter:
traits:
sampling_bias_correction: '3'
traits_columns:
all: 'region nextclade_subtype'
denv1: 'country region nextclade_subtype'
denv2: 'country region nextclade_subtype'
denv3: 'country region nextclade_subtype'
denv4: 'country region nextclade_subtype'
all: 'region ncbi_serotype nextclade_subtype'
denv1: 'country region ncbi_serotype nextclade_subtype'
denv2: 'country region ncbi_serotype nextclade_subtype'
denv3: 'country region ncbi_serotype nextclade_subtype'
denv4: 'country region ncbi_serotype nextclade_subtype'

clades:
clade_definitions:
Expand Down
10 changes: 5 additions & 5 deletions phylogenetic/rules/annotate_phylogeny.smk
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ rule translate:
input:
tree = "results/{gene}/tree_{serotype}.nwk",
node_data = "results/{gene}/nt-muts_{serotype}.json",
reference = "config/reference_{serotype}_genome.gb"
reference = lambda wildcard: "config/reference_{serotype}_genome.gb" if wildcard.gene in ['genome'] else "results/config/reference_{serotype}_{gene}.gb"
output:
node_data = "results/{gene}/aa-muts_{serotype}.json"
shell:
Expand Down Expand Up @@ -85,12 +85,12 @@ rule traits:
rule clades:
"""Annotating serotypes / genotypes"""
input:
tree = "results/{gene}/tree_{serotype}.nwk",
nt_muts = "results/{gene}/nt-muts_{serotype}.json",
aa_muts = "results/{gene}/aa-muts_{serotype}.json",
tree = "results/genome/tree_{serotype}.nwk",
nt_muts = "results/genome/nt-muts_{serotype}.json",
aa_muts = "results/genome/aa-muts_{serotype}.json",
clade_defs = lambda wildcards: config['clades']['clade_definitions'][wildcards.serotype],
output:
clades = "results/{gene}/clades_{serotype}.json"
clades = "results/genome/clades_{serotype}.json"
shell:
"""
augur clades \
Expand Down
12 changes: 8 additions & 4 deletions phylogenetic/rules/export.smk
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ rule prepare_auspice_config:
output:
auspice_config="results/config/{gene}/auspice_config_{serotype}.json",
params:
replace_clade_key="clade_membership",
replace_clade_key=lambda wildcard: r"clade_membership" if wildcard.gene in ['genome'] else r"nextclade_subtype",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the clade_membership is automatically used to create the clade branch label in augur export, this change means the E gene build will not have the automatic clade branch label.

It's possible to create custom branch labels since nextstrain/augur#728, but this uses augur clades to create the labels and the workflow is skipping augur clades for the E gene build 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the nextclade_subtype field is being added through augur traits, maybe augur traits should be updated to support adding branch labels as well...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know! I'm currently planning to explore serotype and genotype-defining E mutations (which would create clade_membership for E gene builds) in a future PR.

But if that doesn't work out, I'll explore "create custom branch labels ..." route you've referenced.

replace_clade_title=lambda wildcard: r"Serotype" if wildcard.serotype in ['all'] else r"DENV genotype",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To mirror the title NCBI serotype added below, maybe the default clade_membership title should be updated to Nextstrain serotype?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for flagging! I'm considering renaming NCBI serotype toSerotype (NCBI) to better match naming convention used in measle's Genotype (NCBI) title. Then, I agree that the default title for clade_membership should be renamed from Serotype to Serotype (Nextstrain).

To recap:

  • NCBI serotype -> Serotype (NCBI): indicating that denv1-4 assignment is based on NCBI GenBank record annotation
  • Serotype -> Serotype (Nextstrain): indicating that denv1-4 assignment is based on augur clades call using full-genome-level-serotype-defining amino acid mutations
  • Nextclade genotype -> Genotype (Nextclade): indicating genotype level assignment within the serotype (e.g. DENV1/S) based on Nextclade call

This naming adjustment leaves space for a potential Genotype (NCBI) if we develop a script for parsing genotype annotations from the GenBank data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, actually I think I'll merge this pull request as it is, and handle the renaming mentioned above in a later PR. That way, we can address it along with fixing this issue: #41.

run:
data = {
Expand Down Expand Up @@ -81,6 +81,11 @@ rule prepare_auspice_config:
"key": "nextclade_subtype",
"title": "Nextclade genotype",
"type": "categorical"
},
{
"key": "ncbi_serotype",
"title": "NCBI serotype",
"type": "categorical"
}
],
"geo_resolutions": [
Expand All @@ -89,8 +94,7 @@ rule prepare_auspice_config:
],
"display_defaults": {
"map_triplicate": True,
"color_by": params.replace_clade_key,
"distance_measure": "div"
"color_by": params.replace_clade_key
},
"filters": [
"country",
Expand All @@ -113,7 +117,7 @@ rule export:
metadata = "data/metadata_{serotype}.tsv",
branch_lengths = "results/{gene}/branch-lengths_{serotype}.json",
traits = "results/{gene}/traits_{serotype}.json",
clades = "results/{gene}/clades_{serotype}.json",
clades = lambda wildcard: "results/{gene}/clades_{serotype}.json" if wildcard.gene in ['genome'] else [],
nt_muts = "results/{gene}/nt-muts_{serotype}.json",
aa_muts = "results/{gene}/aa-muts_{serotype}.json",
auspice_config = "results/config/{gene}/auspice_config_{serotype}.json",
Expand Down
6 changes: 3 additions & 3 deletions phylogenetic/rules/prepare_sequences.smk
Original file line number Diff line number Diff line change
Expand Up @@ -52,15 +52,15 @@ rule filter:
- excluding strains with missing region, country or date metadata
"""
input:
sequences = "data/sequences_{serotype}.fasta",
sequences = lambda wildcard: "data/sequences_{serotype}.fasta" if wildcard.gene in ['genome'] else "results/{gene}/sequences_{serotype}.fasta",
metadata = "data/metadata_{serotype}.tsv",
exclude = config["filter"]["exclude"],
output:
sequences = "results/{gene}/filtered_{serotype}.fasta"
params:
group_by = config['filter']['group_by'],
sequences_per_group = lambda wildcards: config['filter']['sequences_per_group'][wildcards.serotype],
min_length = config['filter']['min_length'],
min_length = lambda wildcard: config['filter']['min_length'][wildcard.gene],
strain_id = config.get("strain_id_field", "strain"),
shell:
"""
Expand All @@ -83,7 +83,7 @@ rule align:
"""
input:
sequences = "results/{gene}/filtered_{serotype}.fasta",
reference = "config/reference_{serotype}_genome.gb"
reference = lambda wildcard: "config/reference_{serotype}_genome.gb" if wildcard.gene in ['genome'] else "results/config/reference_{serotype}_{gene}.gb"
output:
alignment = "results/{gene}/aligned_{serotype}.fasta"
shell:
Expand Down