[WIP] Scripts to retrieve data from Nextstrain #24

tomasMasson · 2020-04-07T05:16:34Z

It's only a draft script, but we can use it to start discussing ideas and make some refactoring. We can also add the code from Didier. @gtauriello @D-Barradas

D-Barradas · 2020-04-07T07:44:22Z

@tomasMasson Thank you very munch. My code is brute force compared with yours. I think from ["node_attrs"] we should keep ['genbank_accession'], ['gisaid_epi_isl'] and ['author'] , and of course the id of the Children and the mutations

gtauriello · 2020-04-07T13:51:48Z

@all-contributors please add @tomasMasson for content, code

allcontributors · 2020-04-07T13:51:59Z

@gtauriello

I've put up a pull request to add @tomasMasson! 🎉

gtauriello · 2020-04-07T13:52:48Z

@all-contributors please add @D-Barradas for content

allcontributors · 2020-04-07T13:52:56Z

@gtauriello

I've put up a pull request to add @D-Barradas! 🎉

tomasMasson · 2020-04-07T20:18:06Z

First refactoring of the code. Output is now a .csv file with the following headers (Protein, Mutation, Isolate, Author and GISAID). Omitted the genbank field because not all the samples have it (and we have the UniProtKD AC). @gtauriello

schdaude · 2020-04-08T16:15:10Z

Hey, sweet parsing!
Does anybody know what reference sequences nextstrain uses? I ran the parsing script locally and checked the consistency with the underlying uniprot sequences. For a mutation of the form K2160E I assume that the reference one letter code at position 2160 is K and in that bug it has been mutated to E.

That example comes from the following line in the csv:
P0DTD1,K2160E,Hangzhou/ZJU-07/2020,Yao et al,EPI_ISL_416425

However, that location in P0DTD1 is a C. That seems to occur 274 out of 1220 times.

puzzled...

thats the hacky code I'm running btw:

from utils import uniprot

csv_data = open("nextstrain_data.csv", 'r').readlines()[1:]

acs = set()
for line in csv_data:
  acs.add(line.split(',')[0])

sequences = dict()
for ac in acs:
  sequences[ac] = uniprot.seq_from_ac(ac)

for line in csv_data:
  ac = line.split(',')[0]
  mutation = line.split(',')[1]
  orig = mutation[0]
  num = int(mutation[1:len(mutation)-1])
  if sequences[ac][num-1] != orig:
    print(line.strip() + "  uniprot says: " + sequences[ac][num-1])

tomasMasson · 2020-04-08T16:59:45Z

Maybe i am messing out with the gene product names. I understand that ORF1b refers to poliprotein 1ab (P0DTD1), but I might be wrong. Below is attached the node with the conflict (that's all the data available at Nextstrain:
"branch_attrs": {
"labels": {
"aa": "ORF1b: K2160E"
},
"mutations": {
"ORF1b": [
"K2160E"
],
"nuc": [
"T1405C",
"G9802T",
"A19945G",
"C25267T",
"C27615G"
]
}
},
"name": "Hangzhou/ZJU-07/2020",
"node_attrs": {
"age": {
"value": "0"
},
"author": {
"author": "Yao et al",
"value": "Yao et al"
},
"clade_membership": {
"value": "unassigned"
},
"country": {
"confidence": {
"China": 1.0
},
"entropy": -1.000088900581841e-12,
"value": "China"
},
"div": 4.999414825011929,
"division": {
"value": "Zhejiang"
},
"gisaid_epi_isl": {
"value": "EPI_ISL_416425"
},
"host": {
"value": "Human"
},
"location": {
"value": "Hangzhou"
},
"num_date": {
"confidence": [
2020.0915300546449,
2020.0915300546449
],
"value": 2020.0915300546449
},
"originating_lab": {
"value": "State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, National Clinical Research Center for Infectious Diseases, First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China 310003"
},
"recency": {
"value": "One month ago"
},
"region": {
"value": "Asia"
},
"sex": {
"value": "Male"
},
"submitting_lab": {
"value": "State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, National Clinical Research Center for Infectious Diseases, First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China 310003"
},
"url": "https://www.gisaid.org"
}
}

gtauriello · 2020-04-08T17:01:23Z

It's good to double-check with the sequences (I had mentioned that in the issue #10 too).

So according to their documentaion they use the GenBank entry MN908947 as reference. According to our own documentation, the MN908947 sequence is identical to the NCBI reference sequence NC_045512. But probably we need to double-check that this is actually the case and that the sequences match the UniProt sequences...

gtauriello · 2020-04-08T17:08:58Z

So my guess would be that to map ORF1b to P0DTD1 you need an offset which should correspond to the length of the ORF1a stretch (4401 AA (== (13469-266)/3) looking at their ORF1a start/end positions). So would a '+ 4401' work for the mapping? (it does for the K2160 case...)

schdaude · 2020-04-08T17:24:35Z

4401 seems to be a fantastic number, we're on the right track here... hacking this offset for all identifiers originating from ORF1b reduces the output of the code above to:
P0DTD1,C5865Y,USA/NY-NYUMC40/2020,Maria Aguero-Rosenfeld et al,EPI_ISL_419701 uniprot says: Y
P0DTC8,S84L,USA/NY-NYUMC40/2020,Maria Aguero-Rosenfeld et al,EPI_ISL_419701 uniprot says: L
P0DTD1,V1997A,Netherlands/NoordBrabant_29/2020,Nieuwenhuijse et al,EPI_ISL_414538 uniprot says: A
P0DTD1,F3606L,Nanchang/JX176/2020,Li et al,EPI_ISL_421261 uniprot says: L
P0DTD1,S765P,USA/NY-NYUMC1/2020,Chen et al,EPI_ISL_414639 uniprot says: P
P0DTD1,V739I,Belgium/DV-0324117/2020,Joan Marti-Carreras et al,EPI_ISL_420374 uniprot says: I
P0DTD1,F3606L,Belgium/ULG-7019/2020,Keith et al,EPI_ISL_417025 uniprot says: L
P0DTC2,P943S,Belgium/BJ-030767/2020,Joan Marti-Carreras et al,EPI_ISL_420323 uniprot says: S
P0DTD1,*4379Y,Portugal/PT0090/2020,Guiomar et al et al,EPI_ISL_421493 uniprot says: Y
P0DTD1,T2187N,Netherlands/NA_24/2020,Nieuwenhuijse et al,EPI_ISL_415481 uniprot says: N
P0DTD1,L4715P,Belgium/SB-030990/2020,Joan Marti-Carreras et al,EPI_ISL_420346 uniprot says: P
P0DTC2,G614D,Australia/VIC123/2020,Seemann et al,EPI_ISL_419731 uniprot says: D
P0DTC5,M175T,Belgium/BC-03016/2020,Vanmechelen et al,EPI_ISL_415157 uniprot says: T
P0DTD1,I265T,Senegal/610/2020,Dia et al,EPI_ISL_420075 uniprot says: T
P0DTC3,H57Q,NanChang/JX216/2020,Li jian Xiong et al,EPI_ISL_417420 uniprot says: Q
P0DTC2,G614D,NanChang/JX216/2020,Li jian Xiong et al,EPI_ISL_417420 uniprot says: D
P0DTC3,H57Q,Belgium/BCM-0324160/2020,Joan Marti-Carreras et al,EPI_ISL_420417 uniprot says: Q
P0DTD1,L4715P,USA/NY-NYUMC22/2020,Maria Aguero-Rosenfeld et al,EPI_ISL_418968 uniprot says: P

…name

gtauriello · 2020-04-08T18:37:18Z

Hmmm. Maybe we need to check with our friends at nextstrain for that...

I noticed a pattern where all those errors seem swapped (i.e. the result of the mutation seems to match the reference sequence). So I checked the "nuc" entry of the mutation (i.e. what changed in the genome) and compared with the reference genome. And alas I saw the same swapped data. So there seems something off with the input data...

I will open an issue on their github...

In the meantime I suppose we can just continue and mark the non-fitting entries somehow in the annotation? We could check if they fit the "reversed" assumption (e.g. wrong "L4715P" should be "P4715L") so that we can track if there are any other inconsistencies....

schdaude · 2020-04-08T21:41:53Z

Thanks @gtauriello for reporting to the nextstrain people! So to conclude, the mutations they report are "following the branches" of the phylogenetic tree they constructed. As far as I understand we can not only have those back mutations but also mutations x->y where none of x and y match the reference uniprot sequence. As we have a hard time to include and display phylogeny, we need to define what we want to show in the end.

The observed mutation events? That's the information we currently have in the csv. But I'm not sure how relevant this information is without the phylogeny.

All possible amino acids we ever observed at a certain position? That would give some idea of variation.

Perform a separate annotation for each sequenced virus? Rational here is that in the current data we have the mutation relative to its ancestor but not all mutations it picked up during evolution.

Other ideas?

gtauriello · 2020-04-08T22:41:49Z

Indeed. We definitely need to collect mutations on the same position. I would propose to parametrize the code so that one can have multiple annotations:

Color by the number of distinct amino acids observed at that position (and list them in the annotation)
Color by the number of mutation events observed at that position (and list the events in the annotation)

While it might be interesting to do item 1 above while grouping the annotations by some criteria like country of origin (or a list of countries, e.g. "Europe"), I feel like the philogeny data structure of nextstrain would not be the way to go and we should get in touch with the hackathon topic "pangenome" (and/or "philogeny"). An MSA with all strains seems like the better starting point there...

tomasMasson · 2020-04-08T23:26:45Z

Thank you for clarifying the issue. I'm in the Phylogeny channel, and we are working in the the sequence retrieval (~ 320 genomes deposited in NCBI) and subsequent alignment. We could use this MSA to extract positions containing mutations.

gtauriello · 2020-04-09T09:53:39Z

Sorry if my reorga of linking issues to PRs triggered too many emails...

@tomasMasson will the philogeny people only look at those 320? I mean nextstrain has 10x more genomes. Would be a waste to ignore that data. I know that GISAID is not the nicest resource to work with but maybe one can bypass it in some reasonable ways at least to make analyses available. I mean nextstrain seems to be able to do just that. Did you guys look at the the work done at the UCSC Genome Browser?

For the purpose of this issue/PR I would say that we do what we can with the nextstrain data and then move on to other data sources...

The script collects all amino acid one letter codes observed at a certain location and prepares them for upload to the SWISS-MODEL annotation system.

…data

The previous implementation only annotated the distinct amino acids observed at a certain location.

schdaude · 2020-04-15T08:58:21Z

I directly pushed some commits into the master branch from @tomasMasson that implement pretty much the two items described by @gtauriello above.

Initial script to retrieve data from Nextstrain

bcfc3ee

allcontributors bot mentioned this pull request Apr 7, 2020

docs: add tomasMasson as a contributor #26

Merged

allcontributors bot mentioned this pull request Apr 7, 2020

docs: add D-Barradas as a contributor #27

Merged

tomasMasson added 2 commits April 7, 2020 16:54

Fix json keys. Added output format.

2f7133a

Updated README.md

c3457c7

Corrected ORF1b offset (+4401) and added a column displaying the ORF …

c1ff7f4

…name

gtauriello mentioned this pull request Apr 8, 2020

Some mutation data entries don't fit gene / protein sequences nextstrain/ncov#345

Closed

gtauriello changed the title ~~Initial script to retrieve data from Nextstrain~~ Initial script to retrieve data from Nextstrain (for #10) Apr 9, 2020

gtauriello changed the title ~~Initial script to retrieve data from Nextstrain (for #10)~~ Initial script to retrieve data from Nextstrain Apr 9, 2020

gtauriello linked an issue Apr 9, 2020 that may be closed by this pull request

Include variations from processed data in nextstrain #10

Open

gtauriello changed the title ~~Initial script to retrieve data from Nextstrain~~ Scripts to retrieve data from Nextstrain Apr 9, 2020

gtauriello changed the title ~~Scripts to retrieve data from Nextstrain~~ [WIP] Scripts to retrieve data from Nextstrain Apr 9, 2020

schdaude added 2 commits April 15, 2020 08:41

Annotation of variations extracted from Nextstrain data

988fab1

The script collects all amino acid one letter codes observed at a certain location and prepares them for upload to the SWISS-MODEL annotation system.

Enable color coding for number of distinct amino acids in Nextstrain …

f3e2c3e

…data

Additionally allow to annotate mutation events themselves

d5569a8

The previous implementation only annotated the distinct amino acids observed at a certain location.

Refactoring of get_nextstrain_data.py

6893e5c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Scripts to retrieve data from Nextstrain #24

[WIP] Scripts to retrieve data from Nextstrain #24

tomasMasson commented Apr 7, 2020

D-Barradas commented Apr 7, 2020

gtauriello commented Apr 7, 2020

allcontributors bot commented Apr 7, 2020

gtauriello commented Apr 7, 2020

allcontributors bot commented Apr 7, 2020

tomasMasson commented Apr 7, 2020

schdaude commented Apr 8, 2020 •

edited

Loading

tomasMasson commented Apr 8, 2020

gtauriello commented Apr 8, 2020

gtauriello commented Apr 8, 2020 •

edited

Loading

schdaude commented Apr 8, 2020

gtauriello commented Apr 8, 2020

schdaude commented Apr 8, 2020

gtauriello commented Apr 8, 2020

tomasMasson commented Apr 8, 2020

gtauriello commented Apr 9, 2020

schdaude commented Apr 15, 2020

[WIP] Scripts to retrieve data from Nextstrain #24

Are you sure you want to change the base?

[WIP] Scripts to retrieve data from Nextstrain #24

Conversation

tomasMasson commented Apr 7, 2020

D-Barradas commented Apr 7, 2020

gtauriello commented Apr 7, 2020

allcontributors bot commented Apr 7, 2020

gtauriello commented Apr 7, 2020

allcontributors bot commented Apr 7, 2020

tomasMasson commented Apr 7, 2020

schdaude commented Apr 8, 2020 • edited Loading

tomasMasson commented Apr 8, 2020

gtauriello commented Apr 8, 2020

gtauriello commented Apr 8, 2020 • edited Loading

schdaude commented Apr 8, 2020

gtauriello commented Apr 8, 2020

schdaude commented Apr 8, 2020

gtauriello commented Apr 8, 2020

tomasMasson commented Apr 8, 2020

gtauriello commented Apr 9, 2020

schdaude commented Apr 15, 2020

schdaude commented Apr 8, 2020 •

edited

Loading

gtauriello commented Apr 8, 2020 •

edited

Loading