protein-structure-template

Set of tools for writing templates for structured data extraction from protein structures.

Templating

Available Fields

Field	Type	Description
`$name`	`str`	identifyer for protein
`$residue_dist`	`dict[str, float]`	distribution of amino acid occurences. Ordered dict (high to lowest occurences)
`$secondary_structure`	`dict[str, float]`	secondary structure distribution, keys are helix, sheet, coil, and values are percentage of overall structure, sorted by highest occurring (first key)
`$isoelectric`	`float`	single value isoelectric point value, pH value for neutral charge
`$sasa`	`float`	solvent accessible surface area

`$TMAlign`	🚨TODO🚨

Tools

DSSP

Installation

conda install -c salilab dssp

If you get errors about libboost_threads* you can install the package via conda with

conda install anaconda::libboost==1.73.0

Usage

mkdssp -i [pdb/cif] -o [output.dssp]

Output Taken directly from the (now defunct) DSSP webpage

HEADER    HYDROLASE   (SERINE PROTEINASE)         17-MAY-76   1EST
...
  240  1  4  4  0 TOTAL NUMBER OF RESIDUES, NUMBER OF CHAINS,
                  NUMBER OF SS-BRIDGES(TOTAL,INTRACHAIN,INTERCHAIN)                .
 10891.0   ACCESSIBLE SURFACE OF PROTEIN (ANGSTROM**2)
  162 67.5   TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(J)  ; PER 100 RESIDUES
    0  0.0   TOTAL NUMBER OF HYDROGEN BONDS IN     PARALLEL BRIDGES; PER 100 RESIDUES
   84 35.0   TOTAL NUMBER OF HYDROGEN BONDS IN ANTIPARALLEL BRIDGES; PER 100 RESIDUES
...
   26 10.8   TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+2)
   30 12.5   TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+3)
   10  4.2   TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+4)
...
  #  RESIDUE AA STRUCTURE BP1 BP2  ACC   N-H-->O  O-->H-N  N-H-->O  O-->H-N
    2   17   V  B 3   +A  182   0A   8  180,-2.5 180,-1.9   1,-0.2 134,-0.1
                                   ...Next two lines wrapped as a pair...
                                    TCO  KAPPA ALPHA  PHI   PSI    X-CA   Y-CA   Z-CA
                                  -0.776 360.0   8.1 -84.5 125.5  -14.7   34.4   34.8
                                   ...Next two lines wrapped as a pair...
                                               CHAIN AUTHCHAIN
                                                   A         A
....;....1....;....2....;....3....;....4....;....5....;....6....;....7..
    .-- sequential resnumber, including chain breaks as extra residues
    |    .-- original PDB resname, not nec. sequential, may contain letters
    |    | .-- one-letter chain ID, if any
    |    | | .-- amino acid sequence in one letter code
    |    | | |  .-- secondary structure summary based on columns 19-38
    |    | | |  | xxxxxxxxxxxxxxxxxxxx recommend columns for secstruc details
    |    | | |  | .-- 3-turns/helix
    |    | | |  | |.-- 4-turns/helix
    |    | | |  | ||.-- 5-turns/helix
    |    | | |  | |||.-- geometrical bend
    |    | | |  | ||||.-- chirality
    |    | | |  | |||||.-- beta bridge label
    |    | | |  | ||||||.-- beta bridge label
    |    | | |  | |||||||   .-- beta bridge partner resnum
    |    | | |  | |||||||   |   .-- beta bridge partner resnum
    |    | | |  | |||||||   |   |.-- beta sheet label
    |    | | |  | |||||||   |   ||   .-- solvent accessibility
    |    | | |  | |||||||   |   ||   |
  #  RESIDUE AA STRUCTURE BP1 BP2  ACC
    |    | | |  | |||||||   |   ||   |
   35   47 A I  E     +     0   0    2
   36   48 A R  E >  S- K   0  39C  97
   37   49 A Q  T 3  S+     0   0   86
   38   50 A N  T 3  S+     0   0   34
   39   51 A W  E <   -KL  36  98C   6

TMAlign

Installation

Get TMAlign.cpp from ZhangLab
compile with: g++ -static -O3 -ffast-math -lm -o TMalign TMalign.cpp, this will serve as an input to the tmalign.py file. Note: some machines may not support the static flag, feel free to remove.

Note: this is available on ash.cels.anl.gov at /home/khippe/github/tm-align/TMalign

Usage

./TMAlign struct1.pdb struct2.pdb

Note: It appears to work on mmCIF files, but does not advertise support for them.

Output Defaults to output to standard out. Look for the line with TMAlign = [float] and will show you two scores, the first is a alignment normalized to the length of the first structure, the second is the alignment score normalized to the second structure.

Goals

Single protein/gene
- DSSP for secondary structure
- PDB header extraction - syntactically meaningful metadata extraction Update: If we are only using computational predictions the headers will not be meaningful
- point cloud stats(??, spread?)
- structural motifs (TIM barrell, Greek key)
- structure bio-chemical stats (aa-distribution, isoelectric points, solvent accessibility, etc) Update: could do more stats, but basic ones are implemented
  - Radius of Gyration
  - number/type of inter-residue contacts
- binding domain, if applicable (this might be hard)
- (some tool, leaving in for edit) for identifying poorly folded regions
Two protein/genes (includes all above)
- TM-align for pairwise structural similarity
Many protein/genes (includes all above)
- Ideally this would be a MSA for structure that shows the relationship between them all but this is a full scale project I think

TODO

generalized argument parser
generalized output formatter

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
pst		pst
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

protein-structure-template

Templating

Tools

DSSP

TMAlign

Goals

About

Releases

Packages

Contributors 2

Languages

ramanathanlab/protein-structure-template

Folders and files

Latest commit

History

Repository files navigation

protein-structure-template

Templating

Tools

DSSP

TMAlign

Goals

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages