-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nextclade assignment #16
Conversation
9ae8383
to
e638db2
Compare
e638db2
to
3321925
Compare
ceae8c5
to
5205fd0
Compare
nextclade_data/README.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't done an in-depth review yet but wanted to ask about the Nextclade dataset in general:
- Should we start with V3 datasets instead of V2 so we don't have worry about migrating them later?
- Should the dengue datasets be added to https://github.com/nextstrain/nextclade_data and then pulled down through the usual nextclade commands?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point.
I'll run a test conversion to V3 to see if any errors occur, similar to https://github.com/j23414/test_convert_v3 . Happy to pair-code on this, but no pressure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a quick attempt:
python convert_v2_to_v3.py \
--input-dir ~/github/nextstrain/dengue_branches/nextclade_assignment/nextclade_data/denv1 \
--output-dir dengue_v3/denv1
Wonder what's causing "valueFriendly" errors (nextclade_v3/dengue/denv1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder what's causing "valueFriendly" errors
Do you see errors when running Nextclade v3 via the CLI instead of through the test web UI? I suspect the dataset-url
param has not been updated to support V3 datasets (yet??). Actually, they probably will not support the dataset-url
param going forward considering they are asking people to directly submit datasets to the nextclade_data repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different error, I went and converted the "dengue/all"
nextclade3 run \
--input-dataset ../nextclade_data/all \ #<====== This is created with the convert_v2_to_v3.py script
-j 4 \
--output-tsv data/nextclade_results/nextclade_all.tsv \
--min-length 1000 \
--silent \
data/sequences.fasta
And hitting:
Error:
0: When preprocessing Nextclade graph
1: When retrieving nuc mutations from reference tree node NC_002640.1
2: Encountered a mutation (T447C) in reference tree branch attributes, for which the origin state of the mutation is inconsistent with the state at the parental branch. Mutations origin state is 'T', but tree (inferred from the reference sequence as no prior mutations were observed at this position) has state 'C'. This is likely an inconsistency between reference tree and reference sequence in the Nextclade dataset. Reference sequence should either correspond to the root of the reference tree or the root of the reference tree needs to account for difference between the tree and reference sequence. Please check that your reference tree is consistent with your reference sequence.
Location:
packages_rs/nextclade/src/tree/tree_preprocess.rs:145
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message makes me think the reference tree is somehow violating one of the requirements:
The tree must be rooted at the sample that matches the reference (root) sequence.
You mentioned separately that
dengue (multiple spillovers, long diversity, there is no root strain for all 4 major spillovers) is probably problematic for nextclade v3.
Maybe @corneliusroemer or @ivan-aksamentov can comment on what's the best way to validate this dataset and whether there needs to be some adjustments specifically for Dengue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the troubles! Dataset creation is still a bit rough at edges. I'll mostly provide technical help and let scientists to help with the science.
Wonder what's causing "valueFriendly" errors
The reason for this error is simply that v3 is not released and so we've been breaking things in the dataset formats lately without telling anyone :). You used the latest conversion script from master branch of data repo, but also a random old version of Nextclade Web from a couple of months ago (the "UI2" prototype, https://nextclade-git-ui-2-nextstrain.vercel.app/), and the dataset format is simply incompatible between those versions.
We are pretty much done with the changes and we plan to make the v3 release soon. You can use https://master.clades.nextstrain.org for testing latest Web features in the meantime - this is a snapshot of master branch. And I released 3.0.0-alpha.2 of the CLI yesterday. Likely the last one.
The master deployment of Nextclade Web should show the same error as the recent Nextclade CLI:
https://master.clades.nextstrain.org/?dataset-url=https://github.com/j23414/test_convert_v3/tree/main/dengue_v3/denv1
Until the release, please also refer to the "latest" version of docs, sourced from master branch (where there's v3), rather than "stable" version sourced from release branch (where there's still v2):
https://docs.nextstrain.org/projects/nextclade/en/latest/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2: Encountered a mutation (T447C) in reference tree branch attributes, for which the origin state of the mutation is inconsistent with the state at the parental branch [...]
This is an error from the new tree placement algo. It now has some simple tree building capabilities (it can migrate new nodes deeper into the tree when appropriate, rather than just appending them as leaves). Here the new algo tried to calculate mutations resulting from adjusting node placement, but cannot make sense of the mutations on the ref tree, and cannot proceed.
As stated in the error and as Jover mentioned, it is likely due to a mismatch between the ref sequence fasta and the sequence for root node in the tree JSON. Augur usually provides consistent trees. However, if you then feeding a different ref sequence to Nextclade, the root and ref don't match - the evolution breaks (this is my simplified, perhaps limited, understanding as an engineer). A workaround is to add mutations between ref seq and root node seq to the root node JSON. I think Richard and Cornelius have some scripts for that in their dataset workflow repos. We need some scientific expertise there.
Although it worked in v2, it was likely silently producing incorrect results. Nextclade has no sufficient information to catch the inconsistency between the ref and the root, so the only thing we do is we warn users in the docs and shift responsibility to handle this correctly onto them. In v3 however, this can sometimes manifest if the mutations are inconsistent and it breaks the algorithm. Thus the error.
We did not have many custom datasets, and most datasets were created by Richard and Cornelius, so this was not a critical issue in v2. In v3 we actively encourage people to create datasets, so I am wondering what we can do for dataset authors to catch this earlier, during dataset creation, rather than during Nextclade run. Perhaps we can have a script in the data repo to check the ref tree JSON? Any ideas/help is very welcome here.
If you absolutely have no time to deal with this now, you can disable tree builder in CLI with --without-greedy-tree-builder
flag. This will fall back to the v2 placement algo. And you can revisit this later. But, as I mentioned, it might produce results that are not 100% correct, so this is not recommended in general. It is also not currently possible to disable tree builder in the Web version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
directly submit datasets to the nextclade_data repo.
You are very welcome to submit!
will not support the dataset-url param going forward
Nextclade v3 will retain support of the dataset-url
and dataset-server
as the ways to fetch external datasets! It is important for 3rd party integrations (other apps linking to Nextclade). Also handy for dataset testing, just as you are doing it - it is an intended use-case :).
We also extended the format of accepted URLs - you can use the fancy shortcuts for URLs on GitHub - they are a bit easier to write by hand :) Although in this case it does not make it much shorter
https://master.clades.nextstrain.org/?dataset-url=gh:j23414/test_convert_v3@main@dengue_v3/denv1
You can also omit the branch if it's the default branch:
https://master.clades.nextstrain.org/?dataset-url=gh:j23414/test_convert_v3/dengue_v3/denv1
P.S. Don't hesitate to ping me on Slack or on GitHub for any questions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few points/questions reading through this thread:
- I'd probably not do the v2 -> v3 conversion using the script but rather use an existing v2 dataset with v2 nextclade and make a new v3 compliant dataset in a v3 dataset workflow.
- I don't know much about Dengue, it may make sense to have one dataset for each serotype, and maybe one "all serotypes" dataset in addition. I would think it should be possible to align different serotypes against each other but the result might contain a lot of indels and mutations making it hard to interpret in web UI
- There's a new v3 subcommand
split
that should be helpful here to split by serotypes. But unfortunately it requires creating a minimizer index for which the tooling is a bit hidden at the moment. For now it might still be easiest to make a single "all-serotypes" dataset and use that to split into subtypes. - The inconsistency error you get because we require the genotype on the root branch to be the same as the reference. If the tree root is X and the reference is Y, you need to put all mutations from Y to X on the first branch leading to the tree root. Augur ancestral does this automatically now if you give it the reference as
--root-sequence
(the terminology here is confusing,root-sequence
should really be calledreference-sequence
), see here for how the mpox Nextclade workflow does it: https://github.com/nextstrain/mpox/blob/master/nextclade/Snakefile#L343C1-L370C12 If this is a struggle, you can start by having the root and reference be the same.
3425884
to
9b9a10d
Compare
Looking at the PR, given that you already have working v2 datasets, I don't think it would be unreasonable to continue using As the purpose of this PR is to solely split into serotypes, there's no need for v3 features. Transitioning from 2 -> 3 can happen once this is merged to reduce complexity. |
1ee5ac5
to
2dce01b
Compare
2dce01b
to
de11274
Compare
35ee5d3
to
e4e78b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited to see the dengue subtype assignments through Nextclade!
I'm less experienced on Nextclade datasets themselves, so I focused on the workflow changes. We can discuss the Nextclade dataset specifics in #25.
min_length=1000, # E gene length is approximately 1400 | ||
min_seed_cover=0.1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these flags still need to be provided even through they are defined within the Nextclade datasets under alignmentParams
?
If we need to keep them or if this is a means for users to override the default params, I recommend moving these to configs that can be easily changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct that these flags are equivalent to the alignmentParams
and therefore do not need to be provided. Nonetheless, I'll move these params to the config file (a9b4d01) allowing us to refine them further in accordance with the present diversity in denegue sequences (for genome length and E gene length).
Subsequently, we can update the pathogen.json
file once a definitive value is established.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that's good to know! So once they've been settled in pathogen.json
then we can remove them from the config and the workflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly :D
""" | ||
echo "genbank_accession,nextclade_subtype,nextclade_type" \ | ||
| tr ',' '\t' \ | ||
> results/nextclade_subtype.tsv | ||
|
||
tsv-select -H -f "seqName,clade" {input.nextclade_denv1} \ | ||
| awk 'NR>1 {{print $0"\tDENV1"}}' \ | ||
>> results/nextclade_subtype.tsv | ||
tsv-select -H -f "seqName,clade" {input.nextclade_denv2} \ | ||
| awk 'NR>1 {{print $0"\tDENV2"}}' \ | ||
>> results/nextclade_subtype.tsv | ||
tsv-select -H -f "seqName,clade" {input.nextclade_denv3} \ | ||
| awk 'NR>1 {{print $0"\tDENV3"}}' \ | ||
>> results/nextclade_subtype.tsv | ||
tsv-select -H -f "seqName,clade" {input.nextclade_denv4} \ | ||
| awk 'NR>1 {{print $0"\tDENV4"}}' \ | ||
>> results/nextclade_subtype.tsv | ||
|
||
tsv-join -H \ | ||
--filter-file results/nextclade_subtype.tsv \ | ||
--key-fields genbank_accession \ | ||
--append-fields 'nextclade_subtype,nextclade_type' \ | ||
--write-all ? \ | ||
{input.metadata} \ | ||
> {output.metadata_all} | ||
|
||
tsv-filter -H --str-eq ncbi_serotype:denv1 {output.metadata_all} > {output.metadata_denv1} | ||
tsv-filter -H --str-eq ncbi_serotype:denv2 {output.metadata_all} > {output.metadata_denv2} | ||
tsv-filter -H --str-eq ncbi_serotype:denv3 {output.metadata_all} > {output.metadata_denv3} | ||
tsv-filter -H --str-eq ncbi_serotype:denv4 {output.metadata_all} > {output.metadata_denv4} | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@j23414 and I spoke about this online because I was confused about what was going on here.
Background that was discussed during our chat: The metadata_all.tsv
has all dengue metadata, including sequences that are not assigned serotypes. The metadata must then be joined with each individual serotype outputs from Nextclade to get the sequence subtype. The metadata + subtype TSV is then split back out into individual serotypes to make them available via data.nextstrain.org.
This rule can be simplified by splitting into 3 different rules.
- Aggregates all serotype Nextclade outputs to create a single nextclade_subtype.tsv, i.e. lines 44 - 59.
- Joins metadata with nextclade_subtype.tsv, i.e. lines 61 - 67
- Splits out the all metadata + nextclade file into individual serotype files, i.e. lines 69 - 72
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for chatting! Refactored and (hopefully) clarified here: 11a5cbc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if it was possible to deduplicate some of the code in the three rules and ended up pushing up
831aa0d. Feel free to rework it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for refactoring and deduplicating the code! Works as expected and looks more streamlined now :D
6c4facd
to
37c800f
Compare
3049169
to
fbc972a
Compare
After chatting with people, I'm going to fixup the commits and merge since this branch fixes |
Restore nextclade assignment rules to identify dengue subtypes The final subtype are added to the metadata_all.tsv file and subsequent metadata_denv1.tsv to metadata_denv4.tsv files.
Set final targets to be the pair of metadata_denvX.tsv and sequences_denvX.fasta files. The insertion, alignment and translation files had been produced by nextclade in other pathogen repos, however, we are not using them in dengue yet. If we need them later on, we can add in alignment and translation files (split by serotype) back in at that time.
Since we moved to using the ncbi dataset's 'virus_tax_id' field instead.
Only run nextclade on sequences at least 1000 nt long, this would avoid misclassification of short sequences. The dengue E gene has an approximate length of 1400.
While the alignment parameters can be set in the pathogen.json file, specifying the nextclade alignment parameters in the config file makes it easier to fine tune based on current sequence diversity being analyzed. This will allow us to then later propagate them to the pathogen.json file for Nextclade Web.
…arate rules for clarity This rule can be simplified by splitting into 3 different rules. 1. Aggregates all serotype Nextclade outputs to create a single nextclade_subtype.tsv mapping file 2. Joins metadata with nextclade_subtype.tsv into the final metadata_all.tsv file 3. Splits out the all metadata + nextclade file into individual serotype metadata.tsv files The reasoning is that the metadata_all.tsv has all dengue metadata, including sequences that are not assigned serotypes. Sequences without serotype are not listed in any of the subtype results. Therefore the nextclade_subtypes are added as a new field in metadata_all.tsv and then subsequently split back out into individual serotypes to make them available via data.nextstrain.org. The serotypes still need to be hardcoded because we only have Nextclade datasets for these specific serotypes. Use these hardcoded serotypes as wildcard constraints in the Nextclade rules to deduplicate the shell commands and parallelize the final `split_metadata_by_serotype` rule. co-authored-by: Jover Lee <[email protected]>
This is currently an empty file to indicate the top level pathogen repo. The inclusion of this file allows the nextstrain build invocation to work from any directory regardless of runtime. This is necessary so that rules in "ingest" can access the the external directory "nextclade_data".
fbc972a
to
b5dc665
Compare
Description of proposed changes
This PR is a continuation of breaking out the dengue ingest changes PR#6 into more focused and manageable pull requests. Following the merge of copy ingest, the focus of this PR is to split the dengue sequences by serotype (e.g.
sequences_denv1.fasta
tosequences_denv4.fasta
).Previously for dengue, we had been fetching each serotype separately in this code, leading to redundant fetching and processing of each sequence. Additionally, this method posed the risk of missing sequences in individual serotype builds if NCBI did not annotate the record with the lineage ID.
Therefore, as an alternative, this PR proposes to leverage Nextclade assignment to categorize the records into the major dengue serotypes, and subsequent subtypes.
The primary scope of this PR includes:
Related issue(s)
Checklist