Skip to content

gtauriello/covid-19-Annotations-on-Structures

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mapping sequence data onto structures

All Contributors

This repository collects contributions related to the "Annotations on Structures" topic in the COVID-19 Biohackathon April 5-11 2020.

The context is SWISS-MODEL's involvement in an EU project to combat COVID-19. To accelerate our plan to map relevant annotations onto those structures, we collect tools/platforms which can automatically generate such annotations based on the latest data.

We mainly hope to receive two types of contributions:

  1. Find/generate relevant sequence data (see issues list for inspirational ideas) to be displayed on structures (see section on SWISS-MODEL's annotation system). This should be scripted to enable automated fetching of the latest data.
  2. Write reusable scripts to map the sequence data onto the frame of reference of proteins (this might need translation from position on genome data to position on proteins of SARS-CoV-2 as listed here). These scripts are expected to be useful for the scripts in point 1.

Additional topics of interest:

  • For visualization experts: alternative ways to visualize the protein structures.
  • For RDF/JSON-LD experts: define an RDF ontology and map our json-data (example) to RDF to be used in other knowledge graph efforts. Some efforts exist from PDBj to map structures to RDF but they focus on experimental meta data while we consider structural coverage of the proteins more relevant. Probably SIFTS mappings are the better starting point here. With a minimal "@context" section referring to UniProt we might also be able to turn our existing json to valid json-ld.
  • For protein modelling experts: custom modeling of proteins of interest (e.g. using careful expert-curated target-template alignments or combination of templates)

Preferred technologies

  • Programming languages used within SWISS-MODEL: Python (3.6), C++
  • Dealing with protein structure and sequence data: OpenStructure (example in wiki here)

Guidelines for contributions

Follow the biohackathon's code of conduct and this project's contributions guidelines.

SWISS-MODEL annotation system

NOTE: this is work-in-progress and subject to change.

The beta-server of SWISS-MODEL is used to allow users to upload annotations: https://beta.swissmodel.expasy.org/repository/covid_annotation_upload (a list of projects for registered users can be found here).

Both the user annotations and the display of the viral polyprotein (R1AB_SARS2) are still work-in-progress and may have bugs. If you find problems with those prototype SWISS-MODEL features, please add issues to this github project and we will try to address them as soon as possible.

The annotation format is a plain-text format:

  • One line per annotation

  • Each annotation will consist of 5 or 6 comma- or tab-separated values:

    1. ID (UniProtKB AC or MD5 checksum of the sequence)
    2. Start position (1-based)
    3. End position
    4. Color value
    5. Reference (optional)
    6. Annotation comment
  • Example:

    P0DTD1	3400	3450	#FF00FF	https://swissmodel.expasy.org/repository/	My Awesome Annotation
    P0DTC2	230	330	#FFA500	A text reference	One more!
    
  • The Annotation class available in utils facilitates creation of new annotations:

    from utils.sm_annotations import Annotation
    
    # generate example annotations
    annotation = Annotation()
    
    # Annotation of residue range with color red provided as RGB
    annotation.add("P0DTD1", (10, 20), (1.0, 0.0, 0.0), "red anno")
    
    # Again, annotating a range but this time we're adding a reference
    # and provide the color blue as hex
    annotation.add("P0DTD1", (21, 30), "#0000FF", "blue anno", 
                   reference = "https://swissmodel.expasy.org/")
    
    # Outputs plain text which is accepted on the covid annotation upload 
    print(annotation)
    
    # Or directly do a post request (defaults to SWISS-MODEL beta)
    print("Visit the following url to see awesome things:")
    print(annotation.post(title="awesome things"))

    The last line directly creates a new annotation project and prints its url. An example can be viewed here

  • UniProtKB ACs with links can be found in UniProtKB

    • Our SARS-CoV-2 page shows mapping to mature proteins and the correspondence to RefSeq and GenBank.
    • We also have a list of all SARS-CoV-2 proteins that shows an overview of the ACs and their structural coverage.
    • For cleaved proteins, use the parent protein. For instance an annotation on nsp3 (Non-structural protein 3) must be reported on P0DTD1 (the "parent" protein) with an offset of 818 (as nsp3 start on position 819 of P0DTD1).
    • ViralZone has a well described overview of the proteome here.
    • We propose to ignore the shorter polyprotein (P0DTC1, R1A_SARS2) as it's cleaved into the same mature proteins as the longer one (P0DTD1, R1AB_SARS2) with the exception of a very short peptide (Non-structural protein 11 (nsp11), YP_009725312.1).
    • Two proteins of unknown function (P0DTD2 and P0DTD3) are missing from our SARS-CoV-2 page but can safely be used to map annotations and we will provide structures if possible.
    • Additionally to the SARS-CoV-2 proteins, it also makes sense to map annotations for Q9BYF1 (ACE2_HUMAN). So far this is the only virus-host-interaction for which we have structural information. More interactions have been proposed (e.g. here) but we don't have structures for them (yet).

Also we are actively working on extending the structural coverage of the SARS-CoV-2 proteome by using protein predictions from colleagues participating in CASP.

Context

Protein structure predictions of SARS-CoV-2 have already proven useful to several research projects. To list a few examples which used our models:

Contributors ✨

Thanks goes to these wonderful people (emoji key):


Gerardo Tauriello

📆

Xavier Robin

🔧 📖

bienchen

🔧

Andrew W

🔧 🎨

schdaude

🔧 💻

BarbaraTerlouw

🤔

Vasilis J Promponas

🤔

Ben Busby

🤔 🖋

Laura Blum

🖋

tomasMasson

🖋 💻

Didier Barradas Bautista

🖋

Birgit Meldal

🤔 🖋

Ninjani

💻

mehmet

🖋

Michelle Gill

️️️️🖋

This project follows the all-contributors specification. Contributions of any kind welcome!

About

Mapping sequence data onto structures for the Covid-19 Biohackathon April 2020

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 89.4%
  • Perl 10.6%