This repository contains research code and Jupyter notebooks for detecting remote homology in protein sequences using Protein Language Models (PLMs) such as:
Remote homology detection is a challenging task that involves identifying pairs of proteins with similar structures but low sequence similarity. Specifically, remote homology at the superfamily level involves proteins in the same superfamily but different families. Superfamily membership indicates similar structural characteristics, while family membership indicates high sequence similarity.
The repository is organized as follows:
notebooks/
: Contains Jupyter notebooks used for data exploration, model training, and evaluation.scripts/
: Contains Python scripts for preprocessing data, training models, and running experiments.README.md
: Project overview and instructions.
To run the code in this repository, you need the following dependencies:
- Python 3.8+
- PyTorch 1.8+
- Transformers 4.5+
- NumPy
- Pandas
- Scikit-learn
- HuggingFace Datasets
- Jupyter Notebook
- WandB (optional for experiment tracking)
You can install the required packages using the following command:
pip install -r requirements.txt
git clone https://github.com/enoreese/remote-homology-llm-lora.git
cd remote-homology-llm-lora
Download and preprocess the dataset using the provided scripts. Ensure that the data is placed in the data/
directory.
python scripts/SCOP_processing.py
Train the model using the provided training script. You can customize the training parameters in the configuration file located in the config/
directory.
modal scripts/finetune.py::finetune
Evaluate the fine-tuned model on the validation dataset.
modal scripts/evaluate.py::evaluate
We provide fine-tuned models for remote homology detection. You can download and use these models for your research:
You can load these models using the following code:
from transformers import TextClassificationPipeline, AutoTokenizer, AutoModelForSequenceClassification
prompt = """
[Determine Homology]
SeqPiFamily=KADPCLTFNPDKCQLSFQPDGNRCAVLIKCGWECQSVAIQYKNKTRNNTLASTWQPGDPEWYTVSVPGADGFLRTVNNTFIFEHMCNTAMFMSRQYHMWPPRK
SeqPjFamily=QKLNLMQQTMSFLTHDLTQMMPRPVRGDQGQREPALLAGAGVLASESEGMRFVRGGVVNPLMRLPRSNLLTVGYRIHDGYLERLAWPLTDAAGSVKPTMQKLIPADSLRLQFYDGTRWQESWSSVQAIPVAVRMTLHSPQWGEIERIWLLRGPQ
"""
tokenizer = AutoTokenizer.from_pretrained('path/to/pretrained/model')
model = AutoModelForSequenceClassification.from_pretrained('path/to/pretrained/model')
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer)
prediction = pipe(prompt, return_all_scores=True)
We welcome contributions from the community. If you have suggestions or improvements, please open an issue or submit a pull request.
- Fork the repository.
- Create a new branch:
git checkout -b feature/your-feature-name
. - Make your changes and commit them:
git commit -m 'Add some feature'
. - Push to the branch:
git push origin feature/your-feature-name
. - Open a pull request.
This project is licensed under the MIT License. See the LICENSE
file for more details.
We would like to thank the authors of PLMs and the HuggingFace Transformers library for their contributions to the open-source community. This research is built upon their work.
For any questions or issues, please contact [[email protected]].