Introduction

The reporsitory contains the code for the winning project of the 2023 BioHack NYC hackathon. Our goal was to use graph neural networks to predict to predict the identity of amino acids within ligand binding pockets. ##Inspiration One of the major goals in the field of protein design is creating a protein that can bind any arbitraty small molecule. We were particularly struck by the work of McCann et al.. They were able to design a protein that selectily binds VX nerve agent. The protein was designed using an algorithm called protEvolver, a physics-based genetic algorithim that optimizes protein-ligand interfaces. Methods like these are slow and computationally expensive. The success of proteinMPNN inspired us to try to create a smiliar machine learning model that could incorporate small molecule information in the sequence prediction task. Such a model copuld be used to greatly speed up the design process.
n.b. Since our work on this project, the developers of proteinMPNN have released LigandMPNN

Data

To run this repository on your own the first thing you need to do is download the PDBind refined set and extract the download into this repository.
This dataset contains high resolution protein ligand complexes with well characterized binding information. You can read more about the dataset on the website.

Environment

Use the yml file to install all required dependencies

conda env create -f environment.yml

Results

Our model achieved 78% sequence revocery across the entire validation set.
There appears to be a slight bias towards over represeneted amino acids and away from under represented amino acids.

As expected, the models performance was better on proteins with a large number of homologs. Interestingly, the average sequence recovery on proteins with no homologs was 30% with some individual proteins achieving 100% sequence recovery. This demonstrates the models ability to generalize to novel protein ligand complexes but is still too limited to be considered a valuable tool for design.

Result Recreation Walkthoguh

The notebooks need to be run in a specific order.

Clean Pockets and Examine Ligand Data
Generate Amino Acid Embeddings
Compile Graphs
Writing Fasta Optional
Sequence Similarity Optional
Model Training
Model Evaluation

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.ipynb_checkpoints		.ipynb_checkpoints
AA_mol2		AA_mol2
__pycache__		__pycache__
images		images
.gitignore		.gitignore
Clean Pockets and Examine Ligand Data.ipynb		Clean Pockets and Examine Ligand Data.ipynb
Complile Graphs.ipynb		Complile Graphs.ipynb
Generate Amino Acid Embeddings.ipynb		Generate Amino Acid Embeddings.ipynb
Model Evaluation.ipynb		Model Evaluation.ipynb
Model Training.ipynb		Model Training.ipynb
README.md		README.md
Sequence Results.xlsx		Sequence Results.xlsx
Sequence_similarity.ipynb		Sequence_similarity.ipynb
Writing Fasta.ipynb		Writing Fasta.ipynb
atom2emb.pkl		atom2emb.pkl
environment.yml		environment.yml
graph_utils.py		graph_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Data

Environment

Results

Result Recreation Walkthoguh

About

Releases

Packages

Languages

granepura/BioHack-Project-Walkthrough

Folders and files

Latest commit

History

Repository files navigation

Introduction

Data

Environment

Results

Result Recreation Walkthoguh

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages