Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nextclade assignment #16

Merged
merged 11 commits into from
Feb 24, 2024
Merged

Nextclade assignment #16

merged 11 commits into from
Feb 24, 2024

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Dec 5, 2023

Description of proposed changes

This PR is a continuation of breaking out the dengue ingest changes PR#6 into more focused and manageable pull requests. Following the merge of copy ingest, the focus of this PR is to split the dengue sequences by serotype (e.g. sequences_denv1.fasta to sequences_denv4.fasta).

Previously for dengue, we had been fetching each serotype separately in this code, leading to redundant fetching and processing of each sequence. Additionally, this method posed the risk of missing sequences in individual serotype builds if NCBI did not annotate the record with the lineage ID.

Therefore, as an alternative, this PR proposes to leverage Nextclade assignment to categorize the records into the major dengue serotypes, and subsequent subtypes.

The primary scope of this PR includes:

  • Adding a Nextclade dataset (v2)
  • Splitting dengue sequences into denv1 to denv4 based on assignment
  • Adding Nextclade subtype assignments to the metadata files

Related issue(s)

Checklist

  • Checks pass
  • Manual check
git clone https://github.com/nextstrain/dengue.git
cd dengue
git checkout nextclade_assignment
nextstrain build ingest results/metadata_denv4.tsv

@j23414 j23414 force-pushed the nextclade_assignment branch 2 times, most recently from 9ae8383 to e638db2 Compare December 6, 2023 21:13
@j23414 j23414 requested a review from a team December 12, 2023 22:59
@j23414 j23414 marked this pull request as ready for review December 12, 2023 22:59
@j23414 j23414 force-pushed the nextclade_assignment branch from e638db2 to 3321925 Compare December 12, 2023 23:11
@j23414 j23414 force-pushed the nextclade_assignment branch from ceae8c5 to 5205fd0 Compare January 2, 2024 21:21
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done an in-depth review yet but wanted to ask about the Nextclade dataset in general:

  1. Should we start with V3 datasets instead of V2 so we don't have worry about migrating them later?
  2. Should the dengue datasets be added to https://github.com/nextstrain/nextclade_data and then pulled down through the usual nextclade commands?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point.

I'll run a test conversion to V3 to see if any errors occur, similar to https://github.com/j23414/test_convert_v3 . Happy to pair-code on this, but no pressure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a quick attempt:

python convert_v2_to_v3.py \
  --input-dir ~/github/nextstrain/dengue_branches/nextclade_assignment/nextclade_data/denv1 \
  --output-dir dengue_v3/denv1

Wonder what's causing "valueFriendly" errors (nextclade_v3/dengue/denv1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder what's causing "valueFriendly" errors

Do you see errors when running Nextclade v3 via the CLI instead of through the test web UI? I suspect the dataset-url param has not been updated to support V3 datasets (yet??). Actually, they probably will not support the dataset-url param going forward considering they are asking people to directly submit datasets to the nextclade_data repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different error, I went and converted the "dengue/all"

 nextclade3 run \
   --input-dataset ../nextclade_data/all \  #<====== This is created with the convert_v2_to_v3.py script
  -j 4 \
  --output-tsv data/nextclade_results/nextclade_all.tsv \
  --min-length 1000 \
  --silent \
  data/sequences.fasta

And hitting:

Error: 
   0: When preprocessing Nextclade graph
   1: When retrieving nuc mutations from reference tree node NC_002640.1
   2: Encountered a mutation (T447C) in reference tree branch attributes, for which the origin state of the mutation is inconsistent with the state at the parental branch. Mutations origin state is 'T', but tree (inferred from the reference sequence as no prior mutations were observed at this position) has state 'C'. This is likely an inconsistency between reference tree and reference sequence in the Nextclade dataset. Reference sequence should either correspond to the root of the reference tree or the root of the reference tree needs to account for difference between the tree and reference sequence. Please check that your reference tree is consistent with your reference sequence.

Location:
   packages_rs/nextclade/src/tree/tree_preprocess.rs:145

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message makes me think the reference tree is somehow violating one of the requirements:

The tree must be rooted at the sample that matches the reference (root) sequence.

You mentioned separately that

dengue (multiple spillovers, long diversity, there is no root strain for all 4 major spillovers) is probably problematic for nextclade v3.

Maybe @corneliusroemer or @ivan-aksamentov can comment on what's the best way to validate this dataset and whether there needs to be some adjustments specifically for Dengue.

Copy link
Member

@ivan-aksamentov ivan-aksamentov Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the troubles! Dataset creation is still a bit rough at edges. I'll mostly provide technical help and let scientists to help with the science.

Wonder what's causing "valueFriendly" errors

The reason for this error is simply that v3 is not released and so we've been breaking things in the dataset formats lately without telling anyone :). You used the latest conversion script from master branch of data repo, but also a random old version of Nextclade Web from a couple of months ago (the "UI2" prototype, https://nextclade-git-ui-2-nextstrain.vercel.app/), and the dataset format is simply incompatible between those versions.

We are pretty much done with the changes and we plan to make the v3 release soon. You can use https://master.clades.nextstrain.org for testing latest Web features in the meantime - this is a snapshot of master branch. And I released 3.0.0-alpha.2 of the CLI yesterday. Likely the last one.

The master deployment of Nextclade Web should show the same error as the recent Nextclade CLI:
https://master.clades.nextstrain.org/?dataset-url=https://github.com/j23414/test_convert_v3/tree/main/dengue_v3/denv1

Until the release, please also refer to the "latest" version of docs, sourced from master branch (where there's v3), rather than "stable" version sourced from release branch (where there's still v2):
https://docs.nextstrain.org/projects/nextclade/en/latest/

Copy link
Member

@ivan-aksamentov ivan-aksamentov Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2: Encountered a mutation (T447C) in reference tree branch attributes, for which the origin state of the mutation is inconsistent with the state at the parental branch [...]

This is an error from the new tree placement algo. It now has some simple tree building capabilities (it can migrate new nodes deeper into the tree when appropriate, rather than just appending them as leaves). Here the new algo tried to calculate mutations resulting from adjusting node placement, but cannot make sense of the mutations on the ref tree, and cannot proceed.

As stated in the error and as Jover mentioned, it is likely due to a mismatch between the ref sequence fasta and the sequence for root node in the tree JSON. Augur usually provides consistent trees. However, if you then feeding a different ref sequence to Nextclade, the root and ref don't match - the evolution breaks (this is my simplified, perhaps limited, understanding as an engineer). A workaround is to add mutations between ref seq and root node seq to the root node JSON. I think Richard and Cornelius have some scripts for that in their dataset workflow repos. We need some scientific expertise there.

Although it worked in v2, it was likely silently producing incorrect results. Nextclade has no sufficient information to catch the inconsistency between the ref and the root, so the only thing we do is we warn users in the docs and shift responsibility to handle this correctly onto them. In v3 however, this can sometimes manifest if the mutations are inconsistent and it breaks the algorithm. Thus the error.

We did not have many custom datasets, and most datasets were created by Richard and Cornelius, so this was not a critical issue in v2. In v3 we actively encourage people to create datasets, so I am wondering what we can do for dataset authors to catch this earlier, during dataset creation, rather than during Nextclade run. Perhaps we can have a script in the data repo to check the ref tree JSON? Any ideas/help is very welcome here.

If you absolutely have no time to deal with this now, you can disable tree builder in CLI with --without-greedy-tree-builder flag. This will fall back to the v2 placement algo. And you can revisit this later. But, as I mentioned, it might produce results that are not 100% correct, so this is not recommended in general. It is also not currently possible to disable tree builder in the Web version.

Copy link
Member

@ivan-aksamentov ivan-aksamentov Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directly submit datasets to the nextclade_data repo.

You are very welcome to submit!

will not support the dataset-url param going forward

Nextclade v3 will retain support of the dataset-url and dataset-server as the ways to fetch external datasets! It is important for 3rd party integrations (other apps linking to Nextclade). Also handy for dataset testing, just as you are doing it - it is an intended use-case :).

We also extended the format of accepted URLs - you can use the fancy shortcuts for URLs on GitHub - they are a bit easier to write by hand :) Although in this case it does not make it much shorter
https://master.clades.nextstrain.org/?dataset-url=gh:j23414/test_convert_v3@main@dengue_v3/denv1

You can also omit the branch if it's the default branch:
https://master.clades.nextstrain.org/?dataset-url=gh:j23414/test_convert_v3/dengue_v3/denv1

P.S. Don't hesitate to ping me on Slack or on GitHub for any questions!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few points/questions reading through this thread:

  1. I'd probably not do the v2 -> v3 conversion using the script but rather use an existing v2 dataset with v2 nextclade and make a new v3 compliant dataset in a v3 dataset workflow.
  2. I don't know much about Dengue, it may make sense to have one dataset for each serotype, and maybe one "all serotypes" dataset in addition. I would think it should be possible to align different serotypes against each other but the result might contain a lot of indels and mutations making it hard to interpret in web UI
  3. There's a new v3 subcommand split that should be helpful here to split by serotypes. But unfortunately it requires creating a minimizer index for which the tooling is a bit hidden at the moment. For now it might still be easiest to make a single "all-serotypes" dataset and use that to split into subtypes.
  4. The inconsistency error you get because we require the genotype on the root branch to be the same as the reference. If the tree root is X and the reference is Y, you need to put all mutations from Y to X on the first branch leading to the tree root. Augur ancestral does this automatically now if you give it the reference as --root-sequence (the terminology here is confusing, root-sequence should really be called reference-sequence), see here for how the mpox Nextclade workflow does it: https://github.com/nextstrain/mpox/blob/master/nextclade/Snakefile#L343C1-L370C12 If this is a struggle, you can start by having the root and reference be the same.

@j23414 j23414 force-pushed the nextclade_assignment branch 2 times, most recently from 3425884 to 9b9a10d Compare January 9, 2024 21:49
@corneliusroemer
Copy link
Member

Looking at the PR, given that you already have working v2 datasets, I don't think it would be unreasonable to continue using nextclade2 for now (you can add the 2 to future-proof).

As the purpose of this PR is to solely split into serotypes, there's no need for v3 features. Transitioning from 2 -> 3 can happen once this is merged to reduce complexity.

@j23414
Copy link
Contributor Author

j23414 commented Feb 14, 2024

Rebased after switching to relying on NCBI virus-tax-id to split serotypes.

The subsequent nextclade subtype classification looks much cleaner.

Screenshot 2024-02-14 at 11 27 22 AM

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited to see the dengue subtype assignments through Nextclade!

I'm less experienced on Nextclade datasets themselves, so I focused on the workflow changes. We can discuss the Nextclade dataset specifics in #25.

Comment on lines 12 to 13
min_length=1000, # E gene length is approximately 1400
min_seed_cover=0.1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these flags still need to be provided even through they are defined within the Nextclade datasets under alignmentParams?

If we need to keep them or if this is a means for users to override the default params, I recommend moving these to configs that can be easily changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct that these flags are equivalent to the alignmentParams and therefore do not need to be provided. Nonetheless, I'll move these params to the config file (a9b4d01) allowing us to refine them further in accordance with the present diversity in denegue sequences (for genome length and E gene length).

Subsequently, we can update the pathogen.json file once a definitive value is established.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that's good to know! So once they've been settled in pathogen.json then we can remove them from the config and the workflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly :D

ingest/config/optional.yaml Outdated Show resolved Hide resolved
Comment on lines 43 to 104
"""
echo "genbank_accession,nextclade_subtype,nextclade_type" \
| tr ',' '\t' \
> results/nextclade_subtype.tsv

tsv-select -H -f "seqName,clade" {input.nextclade_denv1} \
| awk 'NR>1 {{print $0"\tDENV1"}}' \
>> results/nextclade_subtype.tsv
tsv-select -H -f "seqName,clade" {input.nextclade_denv2} \
| awk 'NR>1 {{print $0"\tDENV2"}}' \
>> results/nextclade_subtype.tsv
tsv-select -H -f "seqName,clade" {input.nextclade_denv3} \
| awk 'NR>1 {{print $0"\tDENV3"}}' \
>> results/nextclade_subtype.tsv
tsv-select -H -f "seqName,clade" {input.nextclade_denv4} \
| awk 'NR>1 {{print $0"\tDENV4"}}' \
>> results/nextclade_subtype.tsv

tsv-join -H \
--filter-file results/nextclade_subtype.tsv \
--key-fields genbank_accession \
--append-fields 'nextclade_subtype,nextclade_type' \
--write-all ? \
{input.metadata} \
> {output.metadata_all}

tsv-filter -H --str-eq ncbi_serotype:denv1 {output.metadata_all} > {output.metadata_denv1}
tsv-filter -H --str-eq ncbi_serotype:denv2 {output.metadata_all} > {output.metadata_denv2}
tsv-filter -H --str-eq ncbi_serotype:denv3 {output.metadata_all} > {output.metadata_denv3}
tsv-filter -H --str-eq ncbi_serotype:denv4 {output.metadata_all} > {output.metadata_denv4}
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j23414 and I spoke about this online because I was confused about what was going on here.

Background that was discussed during our chat: The metadata_all.tsv has all dengue metadata, including sequences that are not assigned serotypes. The metadata must then be joined with each individual serotype outputs from Nextclade to get the sequence subtype. The metadata + subtype TSV is then split back out into individual serotypes to make them available via data.nextstrain.org.

This rule can be simplified by splitting into 3 different rules.

  1. Aggregates all serotype Nextclade outputs to create a single nextclade_subtype.tsv, i.e. lines 44 - 59.
  2. Joins metadata with nextclade_subtype.tsv, i.e. lines 61 - 67
  3. Splits out the all metadata + nextclade file into individual serotype files, i.e. lines 69 - 72

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for chatting! Refactored and (hopefully) clarified here: 11a5cbc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if it was possible to deduplicate some of the code in the three rules and ended up pushing up
831aa0d. Feel free to rework it!

Copy link
Contributor Author

@j23414 j23414 Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for refactoring and deduplicating the code! Works as expected and looks more streamlined now :D

ingest/workflow/snakemake_rules/nextclade.smk Outdated Show resolved Hide resolved
@j23414 j23414 force-pushed the nextclade_assignment branch 2 times, most recently from 6c4facd to 37c800f Compare February 16, 2024 19:31
@j23414 j23414 force-pushed the nextclade_assignment branch 2 times, most recently from 3049169 to fbc972a Compare February 21, 2024 21:46
@j23414
Copy link
Contributor Author

j23414 commented Feb 22, 2024

After chatting with people, I'm going to fixup the commits and merge since this branch fixes DENV2/AII (aka DENV2/IV) assignment: #28 (comment)

Restore nextclade assignment rules to identify dengue subtypes

The final subtype are added to the metadata_all.tsv file and subsequent
metadata_denv1.tsv to metadata_denv4.tsv files.
j23414 and others added 9 commits February 23, 2024 11:59
Set final targets to be the pair of metadata_denvX.tsv and sequences_denvX.fasta files.

The insertion, alignment and translation files had been produced by nextclade
in other pathogen repos, however, we are not using them in dengue yet. If we need them
later on, we can add in alignment and translation files (split by serotype) back in
at that time.
Since we moved to using the ncbi dataset's 'virus_tax_id' field instead.
Only run nextclade on sequences at least 1000 nt long, this would avoid
misclassification of short sequences. The dengue E gene has an approximate
length of 1400.
While the alignment parameters can be set in the pathogen.json file, specifying
the nextclade alignment parameters in the config file makes it easier to fine tune
based on current sequence diversity being analyzed. This will allow us to
then later propagate them to the pathogen.json file for Nextclade Web.
…arate rules for clarity

This rule can be simplified by splitting into 3 different rules.

1. Aggregates all serotype Nextclade outputs to create a single nextclade_subtype.tsv mapping file
2. Joins metadata with nextclade_subtype.tsv into the final metadata_all.tsv file
3. Splits out the all metadata + nextclade file into individual serotype metadata.tsv files

The reasoning is that the metadata_all.tsv has all dengue metadata, including
sequences that are not assigned serotypes. Sequences without serotype are not listed
in any of the subtype results.

Therefore the nextclade_subtypes are added as a new field in metadata_all.tsv
and then subsequently split back out into individual serotypes to make them
available via data.nextstrain.org.

The serotypes still need to be hardcoded because we only have Nextclade
datasets for these specific serotypes. Use these hardcoded serotypes
as wildcard constraints in the Nextclade rules to deduplicate the
shell commands and parallelize the final `split_metadata_by_serotype`
rule.

co-authored-by: Jover Lee <[email protected]>
This is currently an empty file to indicate the top level pathogen repo.
The inclusion of this file allows the nextstrain build invocation to work
from any directory regardless of runtime.

This is necessary so that rules in "ingest" can access the the external
directory "nextclade_data".
@j23414 j23414 force-pushed the nextclade_assignment branch from fbc972a to b5dc665 Compare February 23, 2024 20:09
@j23414 j23414 merged commit 36a0bb1 into main Feb 24, 2024
32 checks passed
@j23414 j23414 deleted the nextclade_assignment branch February 24, 2024 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants