Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update documentation #7

Merged
merged 2 commits into from
Mar 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 47 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
[![build](https://github.com/haddocking/DeepRank-GNN-esm/actions/workflows/build.yml/badge.svg)](https://github.com/haddocking/DeepRank-GNN-esm/actions/workflows/build.yml)

# DeepRank-GNN-esm
Graph Network for protein-protein interface including language model features

Graph Network for protein-protein interface including language model features.

For details refer to our publication at https://academic.oup.com/bioinformaticsadvances/article/4/1/vbad191/7511844

## Installation

Expand Down Expand Up @@ -36,14 +39,16 @@ $ pytest tests/

### As a scoring function

We provide a command-line interface for DeepRank-GNN-ESM that can be used to score protein-protein complexes. The command-line interface can be used as follows:
We provide a command-line interface for DeepRank-GNN-esm that can easily be used to score protein-protein complexes. The command-line interface can be used as follows:

```bash
$ deeprank-gnn-esm-predict -h
usage: deeprank-gnn-esm-predict [-h] pdb_file
$ deeprank-gnn-esm-predict -h
usage: deeprank-gnn-esm-predict [-h] pdb_file chain_id_1 chain_id_2

positional arguments:
pdb_file Path to the PDB file.
chain_id_1 First chain ID.
chain_id_2 Second chain ID.

optional arguments:
-h, --help show this help message and exit
Expand All @@ -57,59 +62,70 @@ $ wget https://files.rcsb.org/view/1B6C.pdb -q

# make sure the environment is activated
$ conda activate deeprank-gnn-esm-gpu-env
(deeprank-gnn-esm-gpu-env) $ deeprank-gnn-esm-predict 1B6C.pdb
2023-06-28 06:08:21,889 predict:64 INFO - Setting up workspace - /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred
(deeprank-gnn-esm-gpu-env) $ deeprank-gnn-esm-predict 1B6C.pdb A B
2023-06-28 06:08:21,889 predict:64 INFO - Setting up workspace - /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred_A_B
2023-06-28 06:08:21,945 predict:72 INFO - Renumbering PDB file.
2023-06-28 06:08:22,294 predict:104 INFO - Reading sequence of PDB 1B6C.pdb
2023-06-28 06:08:22,423 predict:131 INFO - Generating embedding for protein sequence.
2023-06-28 06:08:22,423 predict:132 INFO - ################################################################################
2023-06-28 06:08:32,447 predict:138 INFO - Transferred model to GPU
2023-06-28 06:08:32,450 predict:147 INFO - Read /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred/all.fasta with 8 sequences
2023-06-28 06:08:32,459 predict:157 INFO - Processing 1 of 2 batches (4 sequences)
2023-06-28 06:08:34,061 predict:157 INFO - Processing 2 of 2 batches (4 sequences)
2023-06-28 06:08:32,450 predict:147 INFO - Read /home/1B6C-gnn_esm_pred_A_B/all.fasta with 2 sequences
2023-06-28 06:08:32,459 predict:157 INFO - Processing 1 of 1 batches (2 sequences)
2023-06-28 06:08:36,462 predict:200 INFO - ################################################################################
2023-06-28 06:08:36,470 predict:205 INFO - Generating graph, using 79 processors
Graphs added to the HDF5 file
Embedding added to the /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred/graph.hdf5 file
2023-06-28 06:09:03,345 predict:220 INFO - Graph file generated: /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred/graph.hdf5
Embedding added to the /home/1B6C-gnn_esm_pred_A_B/graph.hdf5 file file
2023-06-28 06:09:03,345 predict:220 INFO - Graph file generated: /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred_A_B/graph.hdf5
2023-06-28 06:09:03,345 predict:226 INFO - Predicting fnat of protein complex.
2023-06-28 06:09:03,345 predict:234 INFO - Using device: cuda:0
# ...
2023-06-28 06:09:07,794 predict:280 INFO - Predicted fnat for 1B6C: 0.342
2023-06-28 06:09:07,803 predict:290 INFO - Output written to /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred/output.csv
2023-06-28 06:09:07,794 predict:280 INFO - Predicted fnat for 1B6C between chainA and chainB: 0.359
2023-06-28 06:09:07,803 predict:290 INFO - Output written to /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred/GNN_esm_prediction.csv
```

From the output above you can see that the predicted fnat for the 1B6C complex is **0.342**, this information is also written to the `output.csv` file.
From the output above you can see that the predicted fnat for the 1B6C complex is **0.359**, this information is also written to the `GNN_esm_prediction.csv` file.

The command above will generate a folder in the current working directory, containing the following:

```
1B6C-gnn_esm_pred
├── 1B6C.A.pt
├── 1B6C.B.pt
├── 1B6C.pdb
├── GNN_esm_prediction.csv
├── GNN_esm_prediction.hdf5
├── graph.hdf5
└── output.csv
1B6C-gnn_esm_pred_A_B
├── 1B6C.pdb #input pdb file
├── all.fasta #fasta sequence for the pdb input
├── 1B6C.A.pt #esm-2 embedding for chainA in protein 1B6C
├── 1B6C.B.pt #esm-2 embedding for chainB in protein 1B6C
├── graph.hdf5 #input protein graph in hdf5 format
├── GNN_esm_prediction.hdf5 #prediction output in hdf5 format
└── GNN_esm_prediction.csv #prediction output in csv format
```

* * *
### As a framework

### Note about input pdb files

To ensure the mapping between interface residue and esm-2 embeddings is correct, make sure that for all the chains, residue numbering in the PDB file is continuous and starts with residue '1'.

We provide a script (scripts/pdb_renumber.py) to do the numbering.


#### Generate esm-2 embeddings for your protein

#### Generate ems-2 embeddings for your protein
1. Generate fasta sequence in bulk, use script 'get_fasta.py'
1. To generate fasta sequences from PDBs, use script 'get_fasta.py'
```bash
usage: get_fasta.py [-h] pdb_dir output_fasta_name
usage: get_fasta.py [-h] pdb_file_path chain_id1 chain_id2

positional arguments:
pdb_dir Path to the directory containing PDB files
output_fasta_name Name of the combined output FASTA file
pdb_file_path Path to the directory containing PDB files
chain_id1 Chain ID for the first sequence
chain_id2 Chain ID for the second sequence

options:
-h, --help show this help message and exit
```


python scripts/get_fasta.py tests/data/pdb/1ATN/ A B


2. Generate embeddings in bulk from combined fasta files, use the script provided inside esm-2 package,

```bash
Expand All @@ -123,6 +139,7 @@ The command above will generate a folder in the current working directory, conta
Replace 'esm_2_installation_location' with your installation location, 'all.fasta' with fasta sequence generated above, 'tests/data/embedding/1ATN/' with the output folder name for esm embeddings

#### Generate graph

* Example code to generate residue graphs in hdf5 format:
```python
from deeprank_gnn.GraphGenMP import GraphHDF5
Expand Down Expand Up @@ -157,6 +174,7 @@ The command above will generate a folder in the current working directory, conta
```

#### Use pre-trained models to predict

* Example code to use pre-trained DeepRank-GNN-esm model
```python
from deeprank_gnn.ginet import GINet
Expand All @@ -167,7 +185,7 @@ The command above will generate a folder in the current working directory, conta
target = "fnat"
edge_attr = ["dist"]
threshold = 0.3
pretrained_model = deeprank-GNN-esm/paper_pretrained_models/scoring_of_docking_models/gnn_esm/treg_yfnat_b64_e20_lr0.001_foldall_esm.pth.tar
pretrained_model = 'deeprank-GNN-esm/paper_pretrained_models/scoring_of_docking_models/gnn_esm/treg_yfnat_b64_e20_lr0.001_foldall_esm.pth.tar'
node_feature = ["type", "polarity", "bsa", "charge", "embedding"]
device_name = "cuda:0"
num_workers = 10
Expand Down
Loading
Loading