diff --git a/caltech-protein-demo.ipynb b/caltech-protein-demo.ipynb new file mode 100644 index 0000000..9266736 --- /dev/null +++ b/caltech-protein-demo.ipynb @@ -0,0 +1,880 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "fccdc384-f963-4407-8d64-22d20aab1d56", + "metadata": { + "id": "XIyP_0r6zuVc" + }, + "source": [ + "\n", + "\n", + "\n", + "\n", + "
\n", + " Console •\n", + " Docs •\n", + " Templates •\n", + " Discord\n", + "
\n", + "\n", + "# Predicting Protein-Protein Interactions Using a Protein Language Model and Linear Sum Assignment\n", + "\n", + "Welcome! **This is the notebook version of [this post](https://huggingface.co/blog/AmelieSchreiber/protein-binding-partners-with-esm2) by [Amelie Schreiber](https://huggingface.co/blog/AmelieSchreiber/protein-binding-partners-with-esm2).**\n", + "\n", + "#### In this notebook and tutorial, we'll use ESM-2, a **protein language model**, to score pairs of proteins using **masked language modeling loss** in order to **predict pairs of proteins that have a high likelihood of binding to one another**.\n", + "\n", + "First, let's get some background on two major topics we'll cover in this notebook: protein language models, and masked language modeling loss.\n", + "\n", + "## A. Protein Language Models:\n", + "\n", + "Protein language models in biology are computational models that apply the principles of natural language processing (NLP) to the \"language\" of proteins, which is the sequence of amino acids that form a protein. These models treat the sequences of amino acids in proteins similarly to how conventional NLP models treat words in a sentence. The idea is to capture and predict the complex patterns of protein structure, function, and interactions based on their amino acid sequences.\n", + "\n", + "### How They Work:\n", + "1. **Sequence Representation**: Just as words are the basic units of language, amino acids are the basic units of proteins. Protein language models represent proteins as sequences of amino acids, using the one-letter codes (e.g., A for Alanine, R for Arginine) to represent each amino acid.\n", + "\n", + "2. **Learning Patterns**: These models are trained on large databases of known protein sequences, learning patterns and relationships between the amino acids in a sequence. They use algorithms similar to those used in NLP, such as transformers and recurrent neural networks, to capture the contextual relationships between amino acids in a sequence.\n", + "\n", + "3. **Embeddings**: Protein language models generate embeddings for amino acid sequences, which are high-dimensional vector representations that capture the contextual relationships and properties of the sequence. These embeddings can then be used to predict the structure, function, or interactions of the protein.\n", + "\n", + "### Uses:\n", + "- **Protein Structure Prediction**: One of the primary applications is predicting the three-dimensional structure of proteins based on their amino acid sequences. Understanding the structure of a protein is crucial for elucidating its function and for drug discovery efforts.\n", + "\n", + "- **Function Prediction**: These are models like AlphaFold, which can predict the function of proteins by learning from the vast amounts of annotated protein databases. This is particularly useful for newly discovered proteins whose functions are unknown. \n", + "\n", + "- **Protein Engineering**: By understanding how changes in amino acid sequences affect protein structure and function, these models can be used to design proteins with desired properties, such as increased stability or novel functionalities.\n", + "\n", + "- **Drug Discovery**: Protein language models can help in identifying potential drug targets and in designing molecules that interact with proteins in specific ways to modulate their function.\n", + "\n", + "- **Understanding Disease Mechanisms**: They can be used to study how genetic mutations affecting protein sequences lead to changes in protein function and contribute to diseases. This can help in identifying potential therapeutic targets.\n", + "\n", + "In summary, protein language models are a powerful tool in computational biology and bioinformatics, offering insights into protein structure, function, and interactions that are fundamental for biological research and pharmaceutical development.\n", + "\n", + "\n", + "## B. Masked Language Modeling Loss:\n", + "\n", + "Masked Language Modeling (MLM) loss is a concept derived from training techniques used in natural language processing (NLP) models, particularly in the context of transformer-based architectures like BERT (Bidirectional Encoder Representations from Transformers). Although originally developed for text data, the concept can also be applied to other sequences, such as proteins in computational biology, as mentioned earlier. Here, I'll explain the concept primarily from the NLP perspective, but the underlying principles are broadly applicable.\n", + "\n", + "\n", + "### What is Masked Language Modeling?\n", + "Masked Language Modeling is a training strategy where a certain percentage of the input tokens (e.g., words in a sentence) are randomly masked, or hidden from the model, during training. The model's task is to predict these masked tokens based only on the context provided by the unmasked tokens. This approach encourages the model to learn a deep, contextual understanding of the language.\n", + "\n", + "### How MLM Loss is Calculated:\n", + "1. **Token Masking**: In the input sequence, a set percentage of tokens are replaced with a special [MASK] token, although variations of this technique might leave the token unchanged or replace it with a random token a certain percentage of the time to improve robustness.\n", + "\n", + "2. **Model Prediction**: The model processes the altered sequence and tries to predict the original token at each masked position. It generates a probability distribution over the entire vocabulary for each masked token, indicating how likely each token is to be the correct replacement.\n", + "\n", + "3. **Loss Calculation**: The MLM loss for a given masked token is calculated by comparing the model's predicted probability distribution against the true token. This is typically done using a loss function suitable for classification tasks, such as Cross-Entropy Loss. The MLM loss for the entire sequence is the average loss across all masked tokens.\n", + "\n", + "4. **Optimization**: The model parameters are updated to minimize the MLM loss. Through this process, the model learns to use contextual information to predict the masked tokens accurately.\n", + "\n", + "### Purpose and Benefits:\n", + "- **Contextual Understanding**: MLM forces the model to learn context-dependent representations of tokens, as it must use the surrounding tokens to predict the masked ones. This leads to a rich understanding of language (or sequences in other domains).\n", + "\n", + "- **Bidirectional Context**: Unlike traditional language models that predict each token based on the preceding tokens (left-to-right or right-to-left), MLM allows the model to use both left and right context, resulting in more robust embeddings.\n", + "\n", + "- **Pretraining for Downstream Tasks**: MLM is often used for pretraining language models on large text corpora. The pretrained models can then be fine-tuned on smaller, task-specific datasets, significantly improving performance on a wide range of NLP tasks.\n", + "\n", + "In summary, Masked Language Modeling loss is a crucial component of training strategies that aim to develop models capable of understanding the intricate patterns and relationships within sequences, whether they be in natural language texts or biological sequences like proteins.\n", + "\n", + "\n", + "## C. Predicting Protein-Protein Interactions with MLM Loss + Protein Language Models\n", + "In this session, we use the protein language model ESM-2. \n", + "### 1. Understanding ESM-2 and MLM Loss:\n", + "- **ESM-2**: This model is designed to capture the complex patterns of amino acid sequences that define a protein's structure and function. It uses deep learning to understand the 'language' of proteins, learning from vast databases of known protein sequences.\n", + "- **MLM Loss**: In the context of protein modeling, MLM loss measures how well the model predicts the identity of masked (hidden) amino acids in a sequence based on the surrounding context. The loss is lower when the model's predictions are close to the actual sequences.\n", + "\n", + "### 2. Predicting Protein-Protein Interactions:\n", + "- **Sequence Pairing**: To predict PPIs, pairs of protein sequences are analyzed together. This can involve creating a concatenated sequence from two proteins.\n", + "\n", + "- **Masking Strategy**: Amino acids in one or both proteins might be masked, and the model predicts these masked residues based on the context provided by both sequences. This process evaluates how the presence of one protein sequence affects the prediction accuracy for the other, inferring interaction likelihood.\n", + "\n", + "- **MLM Loss Comparison**: By comparing the MLM loss of different protein pairings, the model can infer potential interactions. A lower MLM loss when two proteins are analyzed together versus separately suggests that the model finds a coherent or complementary context between them, indicating a potential interaction.\n", + "\n", + "### 3. Interpretation and Validation:\n", + "- **Interpretation**: A significant drop in MLM loss for a specific protein pair suggests that the model recognizes a meaningful relationship between their sequences, which could reflect a real biological interaction.\n", + "\n", + "- **Validation**: Predicted interactions can be validated through experimental techniques such as co-immunoprecipitation or yeast two-hybrid assays, providing empirical evidence for the model's predictions.\n", + "\n", + "\n", + "\n", + "## This Notebook\n", + "\n", + "In the paper [Pairing interacting protein sequences using masked language modeling](https://arxiv.org/abs/2308.07136), the authors propose a method that uses either of two protein language models, [MSA Transformer](https://huggingface.co/models?other=MSA) or [ESM-2](https://huggingface.co/docs/transformers/model_doc/esm), to predict how likely it is that a pair of proteins bind to one another. In this post, we will focus on ESM-2. The method is very simple:\n", + "1. We take a list of proteins we would like to test for interactions, then concatenate them in pairs.\n", + "2. We use the masked language modeling capabilities of ESM-2 and randomly mask residues, then compute the MLM loss.\n", + "3. We average over several iterations of this for each pair of proteins, obtaining a score that indicates how likely two proteins are to bind to one another.\n", + "4. We then compute a matrix of such scores.\n", + "5. Using this matrix are able to solve the associated linear assignment problem to compute optimal binding partners.\n", + "\n", + "#### Predicting protein-protein interactions is a critical task in molecular biology. Here, we'll use the ESM-2 model from Meta AI to compute Masked Language Model (MLM) loss for protein pairs, aiming to find the pairs with the lowest loss. The rationale is that proteins that interact in reality will produce a lower MLM loss than those that don't.\n", + "\n", + "\n", + "##### Help us make this tutorial better! Please provide feedback on the [Discord channel](https://discord.gg/pnCpkwU3G5) or on [X](https://x.com/harperscarroll)." + ] + }, + { + "cell_type": "markdown", + "id": "6478f6cf-cb8c-4a2b-bad4-ebb655e21173", + "metadata": { + "id": "hWI-uRLEyRgb" + }, + "source": [ + "## Let's begin!\n", + "\n", + "I used a GPU and dev environment from [brev.dev](https://brev.dev). Click the badge below to get your preconfigured instance:\n", + "\n", + "[![](https://uohmivykqgnnbiouffke.supabase.co/storage/v1/object/public/landingpage/brevdeploynavy.svg)](https://console.brev.dev/environment/new?instance=T4:g4dn.xlarge&diskStorage=120&name=protein-demo&python=3.10&cuda=12.1.1)\n", + "\n", + "Once you've checked out your machine and landed in your instance page, select the specs you'd like (I used **Python 3.10 and CUDA 12.1.1**; these should be preconfigured for you if you use the badge above) and click the \"Build\" button to build your verb container. Give this a few minutes.\n", + "\n", + "A few minutes after your model has started Running, click the 'Notebook' button on the top right of your screen once it illuminates (you may need to refresh the screen). You will be taken to a Jupyter Lab environment, where you can upload this Notebook.\n", + "\n", + "Note: You can connect your cloud credits (AWS or GCP) by clicking \"Org: \" on the top right, and in the panel that slides over, click \"Connect AWS\" or \"Connect GCP\" under \"Connect your cloud\" and follow the instructions linked to attach your credentials.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "d9fdb383-710a-4e2f-a7e1-3aed8bd2c588", + "metadata": {}, + "source": [ + "## 1. Install Libraries" + ] + }, + { + "cell_type": "markdown", + "id": "c801cbcb-0e06-4985-b80f-8e76a25f66eb", + "metadata": {}, + "source": [ + "Let's install the libraries we'll be using." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "56147c6c-1e4c-4a5e-8ca0-fc1be239cf14", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.0.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" + ] + } + ], + "source": [ + "!pip install --upgrade numpy scipy transformers plotly jupyter ipywidgets jupyterlab_widgets -q\n", + "!pip install torch torchvision torchaudio -q" + ] + }, + { + "cell_type": "markdown", + "id": "6534f931-efdd-47be-a48b-530070963f63", + "metadata": {}, + "source": [ + "Later, we'll be using jupyter widgets, so we need to make sure a recent nodejs is installed and jupyter widgets are enabled for Jupyter Lab." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "5b28ada0-c551-4828-9bb3-0be1bc918464", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[38;5;79m2024-02-08 03:05:24 - Installing pre-requisites\u001b[0m\n", + "Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 InRelease\n", + "Hit:2 https://deb.nodesource.com/node_18.x nodistro InRelease \n", + "Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease \n", + "Hit:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease \n", + "Hit:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease\n", + "Hit:6 http://security.ubuntu.com/ubuntu jammy-security InRelease\n", + "Reading package lists... Done\n", + "Reading package lists... Done\n", + "Building dependency tree... Done\n", + "Reading state information... Done\n", + "ca-certificates is already the newest version (20230311ubuntu0.22.04.1).\n", + "curl is already the newest version (7.81.0-1ubuntu1.15).\n", + "gnupg is already the newest version (2.2.27-3ubuntu2.1).\n", + "apt-transport-https is already the newest version (2.4.11).\n", + "0 upgraded, 0 newly installed, 0 to remove and 42 not upgraded.\n", + "Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 InRelease\n", + "Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease \n", + "Hit:3 https://deb.nodesource.com/node_18.x nodistro InRelease \n", + "Hit:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease \n", + "Hit:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease \n", + "Hit:6 http://security.ubuntu.com/ubuntu jammy-security InRelease\n", + "Reading package lists... Done\n", + "\u001b[1;32m2024-02-08 03:05:30 - Repository configured successfully. To install Node.js, run: apt-get install nodejs -y\u001b[0m\n", + "Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64 InRelease\n", + "Hit:2 https://deb.nodesource.com/node_18.x nodistro InRelease \u001b[0m\n", + "Hit:3 http://security.ubuntu.com/ubuntu jammy-security InRelease \n", + "Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease\n", + "Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease\n", + "Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease\n", + "Reading package lists... Done\u001b[33m\u001b[33m\u001b[33m\n", + "Building dependency tree... Done\n", + "Reading state information... Done\n", + "42 packages can be upgraded. Run 'apt list --upgradable' to see them.\n", + "Reading package lists... Done\n", + "Building dependency tree... Done\n", + "Reading state information... Done\n", + "nodejs is already the newest version (18.19.0-1nodesource1).\n", + "0 upgraded, 0 newly installed, 0 to remove and 42 not upgraded.\n" + ] + } + ], + "source": [ + "!curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -\n", + "!sudo apt update && sudo apt install nodejs -y" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "89ddc744-926a-4a6f-ba5e-3c90915f88ad", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[LabCleanApp] Cleaning /home/ubuntu/.pyenv/versions/3.10.13/share/jupyter/lab...\n", + "[LabCleanApp] staging not present, skipping...\n", + "[LabCleanApp] Success!\n", + "\u001b[33m(Deprecated) Installing extensions with the jupyter labextension install command is now deprecated and will be removed in a future major version of JupyterLab.\n", + "\n", + "Users should manage prebuilt extensions with package managers like pip and conda, and extension authors are encouraged to distribute their extensions as prebuilt packages \u001b[0m\n", + "Building jupyterlab assets (production, minimized)\n" + ] + } + ], + "source": [ + "!jupyter lab clean\n", + "!jupyter labextension install @jupyter-widgets/jupyterlab-manager" + ] + }, + { + "cell_type": "markdown", + "id": "b1a9f54a-1507-4ab5-bc0e-4bea811475ea", + "metadata": {}, + "source": [ + "Make sure this outputs version >= 18.0.0." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "490fc1aa-6f85-4f88-a7c3-e233f30ac543", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "v18.19.0\n" + ] + } + ], + "source": [ + "!node -v" + ] + }, + { + "cell_type": "markdown", + "id": "a03d695d-41d9-4a47-8b1b-72caed509f08", + "metadata": {}, + "source": [ + "Now, restart Jupyter Lab. \n", + "Exit out of this window. In a Terminal on your laptop, type `brev notebook protein-demo` or `brev notebook ` if the GPU name you chose is different than `protein-demo`. \n", + "Then, click the link that appears in the shell (i.e. Terminal window) " + ] + }, + { + "cell_type": "markdown", + "id": "bc383958-ec2c-4a25-8ee6-a8be38ac84a1", + "metadata": {}, + "source": [ + "## 1.1 Import Libraries" + ] + }, + { + "cell_type": "markdown", + "id": "36566638-4edd-4f39-b81a-3fb47dd6d07c", + "metadata": {}, + "source": [ + "Now let's import the necessary libraries.\n", + "\n", + "- **numpy**: Used for numerical operations, especially for handling matrices.\n", + "- **linear_sum_assignment**: This is an optimization algorithm from the SciPy library that solves the linear sum assignment problem. We'll use this to find the optimal pairing of proteins based on the MLM loss.\n", + "- **transformers**: This is the Hugging Face's library that provides pre-trained models for various NLP tasks. We're using it to load the ESM-2 model and its tokenizer.\n", + "- **torch**: The PyTorch library, on which the transformers library is built." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "154e6b22-8d4c-484d-95a8-b5c51f7fe959", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "from scipy.optimize import linear_sum_assignment\n", + "from transformers import AutoTokenizer, EsmForMaskedLM\n", + "import torch" + ] + }, + { + "cell_type": "markdown", + "id": "b56169c1-1a41-4aba-8320-d88c02fd5fdc", + "metadata": {}, + "source": [ + "## 2. Initialize the Model & Tokenizer\n", + "Here, we load the Meta (f.k.a. Facebook) ESM-2 model using the Hugging Face Transformers library.\n", + "- **tokenizer**: This is used to convert protein sequences into a format suitable for the model.\n", + "- **model**: This is the ESM-2 model, specifically built for protein sequences." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "b5729711-2089-47b8-9a1e-0213efb1f1b6", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "a94148d380374fafb27526d08079b040", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "tokenizer_config.json: 0%| | 0.00/95.0 [00:00" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Create an interactive slider for the threshold value with a default of 8.50\n", + "interact(plot_graph, threshold=widgets.FloatSlider(min=0.0, max=max_threshold, step=0.05, value=8.25))" + ] + }, + { + "cell_type": "markdown", + "id": "96fd9ec5-b964-4aad-a464-497a4a24eafb", + "metadata": {}, + "source": [ + "So, for example, try setting the slider to 8.20 or 8.30 and see what kind of predicted interactome results from this choice of MLM loss threshold. You should also adjust the amount of masked tokens to see how this effects the graph. In general, masking more residues will make the threshold necessary for connections to appear in the predicted PPI graph higher. Note, this code may take a few moments to run, since we are computing the loss for all pairs of proteins in our list, but in general the method is a very fast zero shot method for predicting PPI networks. As a next step you might also consider training the model to predict masked residues of known interacting pairs in order to finetune it to this task further. Another interesting and important question to answer is how the length of the proteins effect this computation. Is the method robust to large variations in lengths, or do the proteins need to be of similar lengths for the method to work?" + ] + }, + { + "cell_type": "markdown", + "id": "470f673e-fbef-423e-91bb-6f64e1a593db", + "metadata": {}, + "source": [ + "# Conclusion\n", + "Now you should be able to use the ESM-2 model to predict potential protein-protein interactions by comparing the MLM loss of different protein pairings. This method provides a novel way of inferring interactions using deep learning techniques. Remember, this approach provides a heuristic and should be combined with experimental validation for conclusive results. As a next step, you might try implementing the ideas in [PepMLM: Target Sequence-Conditioned Generation of Peptide Binders via Masked Language Modeling](https://arxiv.org/abs/2310.03842), which finetunes ESM-2 for generating binding partners using masked language modeling, or you might try finetuning ESM-2 in a similar fashion on concatenated pairs of binding partners, with some percentage of the tokens in each binding partner masked, to see if performance is improved.\n", + "\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}