Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Scripts to retrieve data from Nextstrain #24

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

[WIP] Scripts to retrieve data from Nextstrain #24

wants to merge 8 commits into from

Conversation

tomasMasson
Copy link

It's only a draft script, but we can use it to start discussing ideas and make some refactoring. We can also add the code from Didier. @gtauriello @D-Barradas

@D-Barradas
Copy link

@tomasMasson Thank you very munch. My code is brute force compared with yours. I think from ["node_attrs"] we should keep ['genbank_accession'], ['gisaid_epi_isl'] and ['author'] , and of course the id of the Children and the mutations

@gtauriello
Copy link
Contributor

@all-contributors please add @tomasMasson for content, code

@allcontributors
Copy link
Contributor

@gtauriello

I've put up a pull request to add @tomasMasson! 🎉

@gtauriello
Copy link
Contributor

@all-contributors please add @D-Barradas for content

@allcontributors
Copy link
Contributor

@gtauriello

I've put up a pull request to add @D-Barradas! 🎉

@tomasMasson
Copy link
Author

First refactoring of the code. Output is now a .csv file with the following headers (Protein, Mutation, Isolate, Author and GISAID). Omitted the genbank field because not all the samples have it (and we have the UniProtKD AC). @gtauriello

@schdaude
Copy link
Contributor

schdaude commented Apr 8, 2020

Hey, sweet parsing!
Does anybody know what reference sequences nextstrain uses? I ran the parsing script locally and checked the consistency with the underlying uniprot sequences. For a mutation of the form K2160E I assume that the reference one letter code at position 2160 is K and in that bug it has been mutated to E.

That example comes from the following line in the csv:
P0DTD1,K2160E,Hangzhou/ZJU-07/2020,Yao et al,EPI_ISL_416425

However, that location in P0DTD1 is a C. That seems to occur 274 out of 1220 times.

puzzled...

thats the hacky code I'm running btw:

from utils import uniprot

csv_data = open("nextstrain_data.csv", 'r').readlines()[1:]

acs = set()
for line in csv_data:
  acs.add(line.split(',')[0])

sequences = dict()
for ac in acs:
  sequences[ac] = uniprot.seq_from_ac(ac)

for line in csv_data:
  ac = line.split(',')[0]
  mutation = line.split(',')[1]
  orig = mutation[0]
  num = int(mutation[1:len(mutation)-1])
  if sequences[ac][num-1] != orig:
    print(line.strip() + "  uniprot says: " + sequences[ac][num-1])

@tomasMasson
Copy link
Author

Maybe i am messing out with the gene product names. I understand that ORF1b refers to poliprotein 1ab (P0DTD1), but I might be wrong. Below is attached the node with the conflict (that's all the data available at Nextstrain:
"branch_attrs": {
"labels": {
"aa": "ORF1b: K2160E"
},
"mutations": {
"ORF1b": [
"K2160E"
],
"nuc": [
"T1405C",
"G9802T",
"A19945G",
"C25267T",
"C27615G"
]
}
},
"name": "Hangzhou/ZJU-07/2020",
"node_attrs": {
"age": {
"value": "0"
},
"author": {
"author": "Yao et al",
"value": "Yao et al"
},
"clade_membership": {
"value": "unassigned"
},
"country": {
"confidence": {
"China": 1.0
},
"entropy": -1.000088900581841e-12,
"value": "China"
},
"div": 4.999414825011929,
"division": {
"value": "Zhejiang"
},
"gisaid_epi_isl": {
"value": "EPI_ISL_416425"
},
"host": {
"value": "Human"
},
"location": {
"value": "Hangzhou"
},
"num_date": {
"confidence": [
2020.0915300546449,
2020.0915300546449
],
"value": 2020.0915300546449
},
"originating_lab": {
"value": "State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, National Clinical Research Center for Infectious Diseases, First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China 310003"
},
"recency": {
"value": "One month ago"
},
"region": {
"value": "Asia"
},
"sex": {
"value": "Male"
},
"submitting_lab": {
"value": "State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, National Clinical Research Center for Infectious Diseases, First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China 310003"
},
"url": "https://www.gisaid.org"
}
}

@gtauriello
Copy link
Contributor

It's good to double-check with the sequences (I had mentioned that in the issue #10 too).

So according to their documentaion they use the GenBank entry MN908947 as reference. According to our own documentation, the MN908947 sequence is identical to the NCBI reference sequence NC_045512. But probably we need to double-check that this is actually the case and that the sequences match the UniProt sequences...

@gtauriello
Copy link
Contributor

gtauriello commented Apr 8, 2020

So my guess would be that to map ORF1b to P0DTD1 you need an offset which should correspond to the length of the ORF1a stretch (4401 AA (== (13469-266)/3) looking at their ORF1a start/end positions). So would a '+ 4401' work for the mapping? (it does for the K2160 case...)

@schdaude
Copy link
Contributor

schdaude commented Apr 8, 2020

4401 seems to be a fantastic number, we're on the right track here... hacking this offset for all identifiers originating from ORF1b reduces the output of the code above to:
P0DTD1,C5865Y,USA/NY-NYUMC40/2020,Maria Aguero-Rosenfeld et al,EPI_ISL_419701 uniprot says: Y
P0DTC8,S84L,USA/NY-NYUMC40/2020,Maria Aguero-Rosenfeld et al,EPI_ISL_419701 uniprot says: L
P0DTD1,V1997A,Netherlands/NoordBrabant_29/2020,Nieuwenhuijse et al,EPI_ISL_414538 uniprot says: A
P0DTD1,F3606L,Nanchang/JX176/2020,Li et al,EPI_ISL_421261 uniprot says: L
P0DTD1,S765P,USA/NY-NYUMC1/2020,Chen et al,EPI_ISL_414639 uniprot says: P
P0DTD1,V739I,Belgium/DV-0324117/2020,Joan Marti-Carreras et al,EPI_ISL_420374 uniprot says: I
P0DTD1,F3606L,Belgium/ULG-7019/2020,Keith et al,EPI_ISL_417025 uniprot says: L
P0DTC2,P943S,Belgium/BJ-030767/2020,Joan Marti-Carreras et al,EPI_ISL_420323 uniprot says: S
P0DTD1,*4379Y,Portugal/PT0090/2020,Guiomar et al et al,EPI_ISL_421493 uniprot says: Y
P0DTD1,T2187N,Netherlands/NA_24/2020,Nieuwenhuijse et al,EPI_ISL_415481 uniprot says: N
P0DTD1,L4715P,Belgium/SB-030990/2020,Joan Marti-Carreras et al,EPI_ISL_420346 uniprot says: P
P0DTC2,G614D,Australia/VIC123/2020,Seemann et al,EPI_ISL_419731 uniprot says: D
P0DTC5,M175T,Belgium/BC-03016/2020,Vanmechelen et al,EPI_ISL_415157 uniprot says: T
P0DTD1,I265T,Senegal/610/2020,Dia et al,EPI_ISL_420075 uniprot says: T
P0DTC3,H57Q,NanChang/JX216/2020,Li jian Xiong et al,EPI_ISL_417420 uniprot says: Q
P0DTC2,G614D,NanChang/JX216/2020,Li jian Xiong et al,EPI_ISL_417420 uniprot says: D
P0DTC3,H57Q,Belgium/BCM-0324160/2020,Joan Marti-Carreras et al,EPI_ISL_420417 uniprot says: Q
P0DTD1,L4715P,USA/NY-NYUMC22/2020,Maria Aguero-Rosenfeld et al,EPI_ISL_418968 uniprot says: P



@gtauriello
Copy link
Contributor

Hmmm. Maybe we need to check with our friends at nextstrain for that...

I noticed a pattern where all those errors seem swapped (i.e. the result of the mutation seems to match the reference sequence). So I checked the "nuc" entry of the mutation (i.e. what changed in the genome) and compared with the reference genome. And alas I saw the same swapped data. So there seems something off with the input data...

I will open an issue on their github...

In the meantime I suppose we can just continue and mark the non-fitting entries somehow in the annotation? We could check if they fit the "reversed" assumption (e.g. wrong "L4715P" should be "P4715L") so that we can track if there are any other inconsistencies....

@schdaude
Copy link
Contributor

schdaude commented Apr 8, 2020

Thanks @gtauriello for reporting to the nextstrain people! So to conclude, the mutations they report are "following the branches" of the phylogenetic tree they constructed. As far as I understand we can not only have those back mutations but also mutations x->y where none of x and y match the reference uniprot sequence. As we have a hard time to include and display phylogeny, we need to define what we want to show in the end.

The observed mutation events? That's the information we currently have in the csv. But I'm not sure how relevant this information is without the phylogeny.

All possible amino acids we ever observed at a certain position? That would give some idea of variation.

Perform a separate annotation for each sequenced virus? Rational here is that in the current data we have the mutation relative to its ancestor but not all mutations it picked up during evolution.

Other ideas?

@gtauriello
Copy link
Contributor

Indeed. We definitely need to collect mutations on the same position. I would propose to parametrize the code so that one can have multiple annotations:

  1. Color by the number of distinct amino acids observed at that position (and list them in the annotation)
  2. Color by the number of mutation events observed at that position (and list the events in the annotation)

While it might be interesting to do item 1 above while grouping the annotations by some criteria like country of origin (or a list of countries, e.g. "Europe"), I feel like the philogeny data structure of nextstrain would not be the way to go and we should get in touch with the hackathon topic "pangenome" (and/or "philogeny"). An MSA with all strains seems like the better starting point there...

@tomasMasson
Copy link
Author

Thank you for clarifying the issue. I'm in the Phylogeny channel, and we are working in the the sequence retrieval (~ 320 genomes deposited in NCBI) and subsequent alignment. We could use this MSA to extract positions containing mutations.

@gtauriello gtauriello changed the title Initial script to retrieve data from Nextstrain Initial script to retrieve data from Nextstrain (for #10) Apr 9, 2020
@gtauriello gtauriello changed the title Initial script to retrieve data from Nextstrain (for #10) Initial script to retrieve data from Nextstrain Apr 9, 2020
@gtauriello gtauriello linked an issue Apr 9, 2020 that may be closed by this pull request
@gtauriello gtauriello changed the title Initial script to retrieve data from Nextstrain Scripts to retrieve data from Nextstrain Apr 9, 2020
@gtauriello gtauriello changed the title Scripts to retrieve data from Nextstrain [WIP] Scripts to retrieve data from Nextstrain Apr 9, 2020
@gtauriello
Copy link
Contributor

Sorry if my reorga of linking issues to PRs triggered too many emails...

@tomasMasson will the philogeny people only look at those 320? I mean nextstrain has 10x more genomes. Would be a waste to ignore that data. I know that GISAID is not the nicest resource to work with but maybe one can bypass it in some reasonable ways at least to make analyses available. I mean nextstrain seems to be able to do just that. Did you guys look at the the work done at the UCSC Genome Browser?

For the purpose of this issue/PR I would say that we do what we can with the nextstrain data and then move on to other data sources...

The script collects all amino acid one letter codes observed at a certain
location and prepares them for upload to the SWISS-MODEL annotation system.
The previous implementation only annotated the distinct amino acids observed
at a certain location.
@schdaude
Copy link
Contributor

I directly pushed some commits into the master branch from @tomasMasson that implement pretty much the two items described by @gtauriello above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include variations from processed data in nextstrain
4 participants