-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include variations from processed data in nextstrain #10
Comments
I'll start working in the scripts to fetch variation data from Nextstrain. If you want @gtauriello, I can create a new branch so everyone can see/review the code. |
That's great. Thank you. Yes please do this in a new branch or start a pull request early so people can comment on your code. |
Hi @tomasMasson I'm interested in the branch you will create so I was also working into parsing the variation of nextstrain , I got a result, but my code is very basic and could be more pythonic, so I'm really interested in seeing a code, also what I found as mutations are very strange to me like N3833K (below), I retrieved like 50 like this , so Im asking for a friend here if somebody knows whats with that large number
|
@D-Barradas not sure what you mean with strange mutations. You mean because of 3833 being a large number? ORF1a (aka 'Replicase polyprotein 1a' or P0DTC1 or R1A_SARS2) is indeed a 4405 AA long polyprotein (which is cut into smaller pieces). So not too surprising. Also please don't map mutations to ORF1a but to the longer ORF1ab (aka 'Replicase polyprotein 1ab' or P0DTD1 or R1AB_SARS2) as described in the README of this repo whenever possible. There is a small part (nsp11) at the end of ORF1a where this is ambiguous though due to a ribosomal frameshift (see here for details). There you can either map genome-level variations to both ORF1a and ORF1ab or just keep ignoring the ORF1a part since I am not aware of any relevant role of nsp11. |
@gtauriello thanks for solving my question, it was in did about the number since I was thinking in terms of smaller pieces (400 aa ), then another question, they report in nextstrain ORF1a and ORF1b as separate entities, should we also ignore the ORF1b just to be safe?
|
With ignoring I just meant the part in ORF1a which differs from ORF1ab. Just to be clear... For the naming used here with ORF1a and ORF1b, we should keep all those variations and map them to ORF1ab (P0DTD1) for both. I suppose one needs to be careful with mutations at genome-position 13468 as they can affect 2 amino acids though but no idea how nextstrain handles that. It seems that nextstrain already maps the mutations into protein-sequence space and so with an appropriate offset you should be able to easily map ORF1b to ORF1ab. But please do add some sanity checks to make sure that the sequences match (i.e. if you map "K2160E" from ORF1b onto P0DTD1 we expect a 'K' at that position...). |
A possible followup for this could use data from the China National Center for Bioinformation as done in this related resource from UC Riverside: https://coronavirus3d.org/index.html |
Two more comments on the above:
|
I'll give it a look at both points. |
It looks like Nextstrain guys are releasing the full dataset (12397 genome) at their viz page nextstrain/ncov#364 (comment), with the raw data living at http://data.nextstrain.org/ncov_global.json. However, I could count only 3123 GISAID genomes (pass the json data though a |
Goal is to have a structure-mapped version of the variations displayed in nextstrain.
We envision the following required steps:
The text was updated successfully, but these errors were encountered: