Sequoia: Distance based secondary structure assignment with node classification

This repository contains the implementation of a method adressing the assignment of secondary structure in proteins, solely using inter atomic distances (or a subset of these distances), without knowing the protein sequence information. A protein is modelled as a graph of its atoms or residues depending on the geometric scale considered. In the case of the graphs of residues, edge features are computed using a generalization of the standard dihedral angle on a set of consecutive atoms. This computation is based on a geometric relation which allows the reconstruction dihedral angles (modulo their sign) based on inter-atomic distances. Then, we formalize the problem of secondary structure assignement based on node classification. We make use of a message passing neural network as one approximate solution. We also evaluate the impact of noise on the attribution of secondary structures, as well as the case where only distances between C-alpha atoms are known, which is a more realistic scenario in the case of Nuclear Magnetic Resonance (NMR) measurements.

Dependencies

scipy, Biopython, pytorch, pytorch-geometric, scikit-learn

Tested and validated with python 3.8 (CUDA 11.0) and versions:

pytorch: 1.11.0+cu102

pytorch_geometric: 2.04

Bio: 1.78

sklearn: 0.23.2

Scripts

0 -

In the following, the filedir, datasetA or datasetB directories contain a subdirectory containing the .cif files. Example:

filedir -> 00 -> 1ja2.cif

datasetA -> 00 -> 1na2.cif, 1ba3.cif, ...

1 - simple_baseline_display.py

Parses .cif of .pdb data files and prints results obtained with the First Order Statistics (FOS) method.

python simple_display.py filedir

2 - sequoia_dataload_multibio.py

Module used to construct distance-based features of the protein using .cif or .pdb files.

3 - sequoia_datadump_multibio.py

Writes to .pkl files extracted features (using 2 nearest neighbors) for each protein in the subfolders of filedir (cf. 0. for filedir format). The conformation file allows to consider one conformation per protein. Warning: if set to False, several conformations in the file may be used which may be overlapping in space.

python sequoia_datadump_multibio.py filedir output_filename nb_neighbors conformation calpha_mode dssp_mode conformation_file

Example:

python sequoia_datadump_multibio.py filedir_example/ test_output.pkl 2 xray False True cullpdb_dict.json

4 - sequoia_train_model.py

Trains a GNN for secondary structure prediction, using features distance-based features in pkl_files (training set : 75%). Saves model parameters in the model_path_output file.

python sequoia_train_model.py train_filename classification_type nb_neighbors model_path_output

Example:

python sequoia_train_model.py test_output.pkl helices 2 test_model_output.tch

5 - sequoia_infer_secondary_structures.py

Loads model in .tch file and infer secondary structures after parsing a .pdb of .cif file.

python sequoia_infer_secondary_structures.py input_filename classification_type model_filename calpha_mode dssp_mode output_filename (optional: conformation_table)

Example:

python sequoia_infer_secondary_structures.py filedir_example/00/2W3G.cif helices test_model_output.tch 0 1 sequoia_preds.txt cullpdb_dict.json

6 - create_pml_file.py

Reads output file predictions of sequoia_infer_secondary_structures and construct .pml file for visualization with Pymol. Uses zero_residues.py to renumber residues.

python create_pml_file.py predictions_filename input_filename output_directory

Example:

python create_pml_file.py sequoia_preds.txt 1M22.cif .

Datasets

The list of pdb files for our Dataset A (X-ray cristallography) and Dataset B (NMR conformations) are in datasetA/list_proteins_datasetA.txt and datasetB/list_proteins_datasetB.txt respectively.

datasetA and datasetB directories also .tgz files containing annotations used for beta-sheet clustering.

Example for datasetA from a directory containing list_proteins_datasetA.txt:

mkdir datasetA && cd datasetA

mkdir 00 && cd 00

wget -i ../../list_proteins_datasetA.txt --no-check-certificate

Finally, some examples of trained models are given in the directory examples_models_data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sequoia: Distance based secondary structure assignment with node classification

Dependencies

Scripts

0 -

1 - simple_baseline_display.py

2 - sequoia_dataload_multibio.py

3 - sequoia_datadump_multibio.py

4 - sequoia_train_model.py

5 - sequoia_infer_secondary_structures.py

6 - create_pml_file.py

Datasets

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
examples_models_data		examples_models_data
filedir_example/00		filedir_example/00
README.md		README.md
create_pml_file.py		create_pml_file.py
cullpdb_dict.json		cullpdb_dict.json
list_proteins_datasetA.txt		list_proteins_datasetA.txt
list_proteins_datasetB.txt		list_proteins_datasetB.txt
sequoia_clifford_algebra.py		sequoia_clifford_algebra.py
sequoia_datadump_multibio.py		sequoia_datadump_multibio.py
sequoia_dataload_multibio.py		sequoia_dataload_multibio.py
sequoia_infer_multiple_secondary_structures.py		sequoia_infer_multiple_secondary_structures.py
sequoia_infer_secondary_structures.py		sequoia_infer_secondary_structures.py
sequoia_network.py		sequoia_network.py
sequoia_train_model.py		sequoia_train_model.py
simple_baseline_display.py		simple_baseline_display.py
train_model.py		train_model.py
zero_residues.py		zero_residues.py

Khalife/Sequoia

Folders and files

Latest commit

History

Repository files navigation

Sequoia: Distance based secondary structure assignment with node classification

Dependencies

Scripts

0 -

1 - simple_baseline_display.py

2 - sequoia_dataload_multibio.py

3 - sequoia_datadump_multibio.py

4 - sequoia_train_model.py

5 - sequoia_infer_secondary_structures.py

6 - create_pml_file.py

Datasets

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages