Skip to content

ramanathanlab/protein-structure-template

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

protein-structure-template

Set of tools for writing templates for structured data extraction from protein structures.

Templating

Available Fields

Field Type Description
$name str identifyer for protein
$residue_dist dict[str, float] distribution of amino acid occurences. Ordered dict (high to lowest occurences)
$secondary_structure dict[str, float] secondary structure distribution, keys are helix, sheet, coil, and values are percentage of overall structure, sorted by highest occurring (first key)
$isoelectric float single value isoelectric point value, pH value for neutral charge
$sasa float solvent accessible surface area
$TMAlign 🚨TODO🚨

Tools

DSSP

Installation

conda install -c salilab dssp

If you get errors about libboost_threads* you can install the package via conda with

conda install anaconda::libboost==1.73.0

Usage

mkdssp -i [pdb/cif] -o [output.dssp]

Output Taken directly from the (now defunct) DSSP webpage

HEADER    HYDROLASE   (SERINE PROTEINASE)         17-MAY-76   1EST
...
  240  1  4  4  0 TOTAL NUMBER OF RESIDUES, NUMBER OF CHAINS,
                  NUMBER OF SS-BRIDGES(TOTAL,INTRACHAIN,INTERCHAIN)                .
 10891.0   ACCESSIBLE SURFACE OF PROTEIN (ANGSTROM**2)
  162 67.5   TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(J)  ; PER 100 RESIDUES
    0  0.0   TOTAL NUMBER OF HYDROGEN BONDS IN     PARALLEL BRIDGES; PER 100 RESIDUES
   84 35.0   TOTAL NUMBER OF HYDROGEN BONDS IN ANTIPARALLEL BRIDGES; PER 100 RESIDUES
...
   26 10.8   TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+2)
   30 12.5   TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+3)
   10  4.2   TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+4)
...
  #  RESIDUE AA STRUCTURE BP1 BP2  ACC   N-H-->O  O-->H-N  N-H-->O  O-->H-N
    2   17   V  B 3   +A  182   0A   8  180,-2.5 180,-1.9   1,-0.2 134,-0.1
                                   ...Next two lines wrapped as a pair...
                                    TCO  KAPPA ALPHA  PHI   PSI    X-CA   Y-CA   Z-CA
                                  -0.776 360.0   8.1 -84.5 125.5  -14.7   34.4   34.8
                                   ...Next two lines wrapped as a pair...
                                               CHAIN AUTHCHAIN
                                                   A         A
....;....1....;....2....;....3....;....4....;....5....;....6....;....7..
    .-- sequential resnumber, including chain breaks as extra residues
    |    .-- original PDB resname, not nec. sequential, may contain letters
    |    | .-- one-letter chain ID, if any
    |    | | .-- amino acid sequence in one letter code
    |    | | |  .-- secondary structure summary based on columns 19-38
    |    | | |  | xxxxxxxxxxxxxxxxxxxx recommend columns for secstruc details
    |    | | |  | .-- 3-turns/helix
    |    | | |  | |.-- 4-turns/helix
    |    | | |  | ||.-- 5-turns/helix
    |    | | |  | |||.-- geometrical bend
    |    | | |  | ||||.-- chirality
    |    | | |  | |||||.-- beta bridge label
    |    | | |  | ||||||.-- beta bridge label
    |    | | |  | |||||||   .-- beta bridge partner resnum
    |    | | |  | |||||||   |   .-- beta bridge partner resnum
    |    | | |  | |||||||   |   |.-- beta sheet label
    |    | | |  | |||||||   |   ||   .-- solvent accessibility
    |    | | |  | |||||||   |   ||   |
  #  RESIDUE AA STRUCTURE BP1 BP2  ACC
    |    | | |  | |||||||   |   ||   |
   35   47 A I  E     +     0   0    2
   36   48 A R  E >  S- K   0  39C  97
   37   49 A Q  T 3  S+     0   0   86
   38   50 A N  T 3  S+     0   0   34
   39   51 A W  E <   -KL  36  98C   6

TMAlign

Installation

  1. Get TMAlign.cpp from ZhangLab
  2. compile with: g++ -static -O3 -ffast-math -lm -o TMalign TMalign.cpp, this will serve as an input to the tmalign.py file. Note: some machines may not support the static flag, feel free to remove.

Note: this is available on ash.cels.anl.gov at /home/khippe/github/tm-align/TMalign

Usage

./TMAlign struct1.pdb struct2.pdb

Note: It appears to work on mmCIF files, but does not advertise support for them.

Output Defaults to output to standard out. Look for the line with TMAlign = [float] and will show you two scores, the first is a alignment normalized to the length of the first structure, the second is the alignment score normalized to the second structure.

Goals

  • Single protein/gene
    • DSSP for secondary structure
    • PDB header extraction - syntactically meaningful metadata extraction Update: If we are only using computational predictions the headers will not be meaningful
    • point cloud stats(??, spread?)
    • structural motifs (TIM barrell, Greek key)
    • structure bio-chemical stats (aa-distribution, isoelectric points, solvent accessibility, etc) Update: could do more stats, but basic ones are implemented
      • Radius of Gyration
      • number/type of inter-residue contacts
    • binding domain, if applicable (this might be hard)
    • (some tool, leaving in for edit) for identifying poorly folded regions
  • Two protein/genes (includes all above)
    • TM-align for pairwise structural similarity
  • Many protein/genes (includes all above)
    • Ideally this would be a MSA for structure that shows the relationship between them all but this is a full scale project I think

TODO

  • generalized argument parser
  • generalized output formatter

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published