Include variations from processed data in nextstrain #10

gtauriello · 2020-04-05T14:12:47Z

Goal is to have a structure-mapped version of the variations displayed in nextstrain.
We envision the following required steps:

parse json data following their dev docs
map variations onto UniProtKB ACs used in SWISS-MODEL (the work done at the UCSC Genome Browser could be helpful for this)
define colors and annotation texts variations
test using SWISS-MODEL's annotation system
properly acknowledge source of data (see also "Data" section in nextstrain's README)
followups: add possibility to filter results (e.g. only from country X or certain confidence), process into entropies, ...

gtauriello · 2020-04-05T14:46:02Z

Preliminary work by @jttkim (see here) could be a great starting point for such an effort.

tomasMasson · 2020-04-06T17:18:09Z

I'll start working in the scripts to fetch variation data from Nextstrain. If you want @gtauriello, I can create a new branch so everyone can see/review the code.

gtauriello · 2020-04-06T17:46:44Z

That's great. Thank you. Yes please do this in a new branch or start a pull request early so people can comment on your code.

D-Barradas · 2020-04-06T21:42:53Z

Hi @tomasMasson I'm interested in the branch you will create so I was also working into parsing the variation of nextstrain , I got a result, but my code is very basic and could be more pythonic, so I'm really interested in seeing a code, also what I found as mutations are very strange to me like N3833K (below), I retrieved like 50 like this , so Im asking for a friend here if somebody knows whats with that large number

   gene	          GenBank.        gisaid_epi_isl	     mutations	   author
                      accession  	
   ORF1a	    LR757998	    EPI_ISL_406798	    **L2235I**  	Chen et al|
   ORF1a	    LR757998	    EPI_ISL_406798	    **N3833K**	Chen et al|

gtauriello · 2020-04-06T22:54:38Z

@D-Barradas not sure what you mean with strange mutations. You mean because of 3833 being a large number? ORF1a (aka 'Replicase polyprotein 1a' or P0DTC1 or R1A_SARS2) is indeed a 4405 AA long polyprotein (which is cut into smaller pieces). So not too surprising.

Also please don't map mutations to ORF1a but to the longer ORF1ab (aka 'Replicase polyprotein 1ab' or P0DTD1 or R1AB_SARS2) as described in the README of this repo whenever possible. There is a small part (nsp11) at the end of ORF1a where this is ambiguous though due to a ribosomal frameshift (see here for details). There you can either map genome-level variations to both ORF1a and ORF1ab or just keep ignoring the ORF1a part since I am not aware of any relevant role of nsp11.

D-Barradas · 2020-04-07T05:21:35Z

@gtauriello thanks for solving my question, it was in did about the number since I was thinking in terms of smaller pieces (400 aa ), then another question, they report in nextstrain ORF1a and ORF1b as separate entities, should we also ignore the ORF1b just to be safe?

	ORF1a	ORF1b
end	13468	21555
seqid	config/reference.gb	config/reference.gb
start	266	13468
strand	+	+
type	CDS	CDS

gtauriello · 2020-04-07T10:01:15Z

With ignoring I just meant the part in ORF1a which differs from ORF1ab. Just to be clear...

For the naming used here with ORF1a and ORF1b, we should keep all those variations and map them to ORF1ab (P0DTD1) for both. I suppose one needs to be careful with mutations at genome-position 13468 as they can affect 2 amino acids though but no idea how nextstrain handles that.

It seems that nextstrain already maps the mutations into protein-sequence space and so with an appropriate offset you should be able to easily map ORF1b to ORF1ab. But please do add some sanity checks to make sure that the sequences match (i.e. if you map "K2160E" from ORF1b onto P0DTD1 we expect a 'K' at that position...).

gtauriello · 2020-04-25T13:24:37Z

A possible followup for this could use data from the China National Center for Bioinformation as done in this related resource from UC Riverside: https://coronavirus3d.org/index.html

gtauriello · 2020-04-25T19:10:30Z

Two more comments on the above:

Unsure whether that source for mutations is illegally bypassing GISAID data sharing policies (based on discussions in the public_sequence_resource topic of the biohackathon). So we should use it with care probably. The main source of data there is GISAID and Genebank.
Nextstrain is subsampling their phylogenetic tree (see this discussion here). So we may need another approach to get the full set of variations.

tomasMasson · 2020-04-25T22:36:35Z

I'll give it a look at both points.

tomasMasson · 2020-05-02T21:07:59Z

It looks like Nextstrain guys are releasing the full dataset (12397 genome) at their viz page nextstrain/ncov#364 (comment), with the raw data living at http://data.nextstrain.org/ncov_global.json. However, I could count only 3123 GISAID genomes (pass the json data though a grep filter in the command line).
Additionally, at http://cov-glue.cvr.gla.ac.uk/#/home they released a table with amino acid replacements for the GISAID sequences. The problem with this site is the lack of a download bottom for the data (it is an alpha version, maybe they are going to add it later).

gtauriello added the enhancement New feature or request label Apr 6, 2020

gtauriello added the in progress This is currently being worked on label Apr 6, 2020

gtauriello linked a pull request Apr 8, 2020 that will close this issue

[WIP] Scripts to retrieve data from Nextstrain #24

Open

gtauriello linked a pull request Apr 9, 2020 that will close this issue

[WIP] Scripts to retrieve data from Nextstrain #24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include variations from processed data in nextstrain #10

Include variations from processed data in nextstrain #10

gtauriello commented Apr 5, 2020

gtauriello commented Apr 5, 2020 •

edited

Loading

tomasMasson commented Apr 6, 2020

gtauriello commented Apr 6, 2020

D-Barradas commented Apr 6, 2020 •

edited

Loading

gtauriello commented Apr 6, 2020

D-Barradas commented Apr 7, 2020

gtauriello commented Apr 7, 2020

gtauriello commented Apr 25, 2020

gtauriello commented Apr 25, 2020 •

edited

Loading

tomasMasson commented Apr 25, 2020

tomasMasson commented May 2, 2020

Include variations from processed data in nextstrain #10

Include variations from processed data in nextstrain #10

Comments

gtauriello commented Apr 5, 2020

gtauriello commented Apr 5, 2020 • edited Loading

tomasMasson commented Apr 6, 2020

gtauriello commented Apr 6, 2020

D-Barradas commented Apr 6, 2020 • edited Loading

gtauriello commented Apr 6, 2020

D-Barradas commented Apr 7, 2020

gtauriello commented Apr 7, 2020

gtauriello commented Apr 25, 2020

gtauriello commented Apr 25, 2020 • edited Loading

tomasMasson commented Apr 25, 2020

tomasMasson commented May 2, 2020

gtauriello commented Apr 5, 2020 •

edited

Loading

D-Barradas commented Apr 6, 2020 •

edited

Loading

gtauriello commented Apr 25, 2020 •

edited

Loading