-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Scripts to retrieve data from Nextstrain #24
base: master
Are you sure you want to change the base?
Conversation
@tomasMasson Thank you very munch. My code is brute force compared with yours. I think from ["node_attrs"] we should keep ['genbank_accession'], ['gisaid_epi_isl'] and ['author'] , and of course the id of the Children and the mutations |
@all-contributors please add @tomasMasson for content, code |
I've put up a pull request to add @tomasMasson! 🎉 |
@all-contributors please add @D-Barradas for content |
I've put up a pull request to add @D-Barradas! 🎉 |
First refactoring of the code. Output is now a .csv file with the following headers (Protein, Mutation, Isolate, Author and GISAID). Omitted the genbank field because not all the samples have it (and we have the UniProtKD AC). @gtauriello |
Hey, sweet parsing! That example comes from the following line in the csv: However, that location in P0DTD1 is a C. That seems to occur 274 out of 1220 times. puzzled... thats the hacky code I'm running btw: from utils import uniprot
csv_data = open("nextstrain_data.csv", 'r').readlines()[1:]
acs = set()
for line in csv_data:
acs.add(line.split(',')[0])
sequences = dict()
for ac in acs:
sequences[ac] = uniprot.seq_from_ac(ac)
for line in csv_data:
ac = line.split(',')[0]
mutation = line.split(',')[1]
orig = mutation[0]
num = int(mutation[1:len(mutation)-1])
if sequences[ac][num-1] != orig:
print(line.strip() + " uniprot says: " + sequences[ac][num-1]) |
Maybe i am messing out with the gene product names. I understand that ORF1b refers to poliprotein 1ab (P0DTD1), but I might be wrong. Below is attached the node with the conflict (that's all the data available at Nextstrain: |
It's good to double-check with the sequences (I had mentioned that in the issue #10 too). So according to their documentaion they use the GenBank entry MN908947 as reference. According to our own documentation, the MN908947 sequence is identical to the NCBI reference sequence NC_045512. But probably we need to double-check that this is actually the case and that the sequences match the UniProt sequences... |
So my guess would be that to map ORF1b to P0DTD1 you need an offset which should correspond to the length of the ORF1a stretch (4401 AA (== (13469-266)/3) looking at their ORF1a start/end positions). So would a '+ 4401' work for the mapping? (it does for the K2160 case...) |
4401 seems to be a fantastic number, we're on the right track here... hacking this offset for all identifiers originating from ORF1b reduces the output of the code above to:
|
Hmmm. Maybe we need to check with our friends at nextstrain for that... I noticed a pattern where all those errors seem swapped (i.e. the result of the mutation seems to match the reference sequence). So I checked the "nuc" entry of the mutation (i.e. what changed in the genome) and compared with the reference genome. And alas I saw the same swapped data. So there seems something off with the input data... I will open an issue on their github... In the meantime I suppose we can just continue and mark the non-fitting entries somehow in the annotation? We could check if they fit the "reversed" assumption (e.g. wrong "L4715P" should be "P4715L") so that we can track if there are any other inconsistencies.... |
Thanks @gtauriello for reporting to the nextstrain people! So to conclude, the mutations they report are "following the branches" of the phylogenetic tree they constructed. As far as I understand we can not only have those back mutations but also mutations x->y where none of x and y match the reference uniprot sequence. As we have a hard time to include and display phylogeny, we need to define what we want to show in the end. The observed mutation events? That's the information we currently have in the csv. But I'm not sure how relevant this information is without the phylogeny. All possible amino acids we ever observed at a certain position? That would give some idea of variation. Perform a separate annotation for each sequenced virus? Rational here is that in the current data we have the mutation relative to its ancestor but not all mutations it picked up during evolution. Other ideas? |
Indeed. We definitely need to collect mutations on the same position. I would propose to parametrize the code so that one can have multiple annotations:
While it might be interesting to do item 1 above while grouping the annotations by some criteria like country of origin (or a list of countries, e.g. "Europe"), I feel like the philogeny data structure of nextstrain would not be the way to go and we should get in touch with the hackathon topic "pangenome" (and/or "philogeny"). An MSA with all strains seems like the better starting point there... |
Thank you for clarifying the issue. I'm in the Phylogeny channel, and we are working in the the sequence retrieval (~ 320 genomes deposited in NCBI) and subsequent alignment. We could use this MSA to extract positions containing mutations. |
Sorry if my reorga of linking issues to PRs triggered too many emails... @tomasMasson will the philogeny people only look at those 320? I mean nextstrain has 10x more genomes. Would be a waste to ignore that data. I know that GISAID is not the nicest resource to work with but maybe one can bypass it in some reasonable ways at least to make analyses available. I mean nextstrain seems to be able to do just that. Did you guys look at the the work done at the UCSC Genome Browser? For the purpose of this issue/PR I would say that we do what we can with the nextstrain data and then move on to other data sources... |
The script collects all amino acid one letter codes observed at a certain location and prepares them for upload to the SWISS-MODEL annotation system.
The previous implementation only annotated the distinct amino acids observed at a certain location.
I directly pushed some commits into the master branch from @tomasMasson that implement pretty much the two items described by @gtauriello above. |
It's only a draft script, but we can use it to start discussing ideas and make some refactoring. We can also add the code from Didier. @gtauriello @D-Barradas