Protein Remote Homology Detection Using Large Language Models and LoRA

Introduction

This repository contains research code and Jupyter notebooks for detecting remote homology in protein sequences using Protein Language Models (PLMs) such as:

Remote homology detection is a challenging task that involves identifying pairs of proteins with similar structures but low sequence similarity. Specifically, remote homology at the superfamily level involves proteins in the same superfamily but different families. Superfamily membership indicates similar structural characteristics, while family membership indicates high sequence similarity.

Project Structure

The repository is organized as follows:

notebooks/: Contains Jupyter notebooks used for data exploration, model training, and evaluation.
scripts/: Contains Python scripts for preprocessing data, training models, and running experiments.
README.md: Project overview and instructions.

Requirements

To run the code in this repository, you need the following dependencies:

Python 3.8+
PyTorch 1.8+
Transformers 4.5+
NumPy
Pandas
Scikit-learn
HuggingFace Datasets
Jupyter Notebook
WandB (optional for experiment tracking)

You can install the required packages using the following command:

pip install -r requirements.txt

Quick Start

1. Clone the Repository

git clone https://github.com/enoreese/remote-homology-llm-lora.git
cd remote-homology-llm-lora

2. Prepare the Data

Download and preprocess the dataset using the provided scripts. Ensure that the data is placed in the data/ directory.

python scripts/SCOP_processing.py

3. Fine-tune the Model

Train the model using the provided training script. You can customize the training parameters in the configuration file located in the config/ directory.

modal scripts/finetune.py::finetune

4. Evaluate the Model

Evaluate the fine-tuned model on the validation dataset.

modal scripts/evaluate.py::evaluate

Usage

Fine-tuned Models

We provide fine-tuned models for remote homology detection. You can download and use these models for your research:

You can load these models using the following code:

from transformers import TextClassificationPipeline, AutoTokenizer, AutoModelForSequenceClassification

prompt = """
[Determine Homology]
SeqPiFamily=KADPCLTFNPDKCQLSFQPDGNRCAVLIKCGWECQSVAIQYKNKTRNNTLASTWQPGDPEWYTVSVPGADGFLRTVNNTFIFEHMCNTAMFMSRQYHMWPPRK
SeqPjFamily=QKLNLMQQTMSFLTHDLTQMMPRPVRGDQGQREPALLAGAGVLASESEGMRFVRGGVVNPLMRLPRSNLLTVGYRIHDGYLERLAWPLTDAAGSVKPTMQKLIPADSLRLQFYDGTRWQESWSSVQAIPVAVRMTLHSPQWGEIERIWLLRGPQ
"""

tokenizer = AutoTokenizer.from_pretrained('path/to/pretrained/model')
model = AutoModelForSequenceClassification.from_pretrained('path/to/pretrained/model')

pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer)
prediction = pipe(prompt, return_all_scores=True)

Contributing

We welcome contributions from the community. If you have suggestions or improvements, please open an issue or submit a pull request.

Steps to Contribute

Fork the repository.
Create a new branch: git checkout -b feature/your-feature-name.
Make your changes and commit them: git commit -m 'Add some feature'.
Push to the branch: git push origin feature/your-feature-name.
Open a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Acknowledgements

We would like to thank the authors of PLMs and the HuggingFace Transformers library for their contributions to the open-source community. This research is built upon their work.

For any questions or issues, please contact [[email protected]].

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.idea		.idea
data/__pycache__		data/__pycache__
notebooks		notebooks
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
Remote Homology.pdf		Remote Homology.pdf
ids_file.txt		ids_file.txt
requirements.txt		requirements.txt
test.txt		test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein Remote Homology Detection Using Large Language Models and LoRA

Introduction

Project Structure

Requirements

Quick Start

1. Clone the Repository

2. Prepare the Data

3. Fine-tune the Model

4. Evaluate the Model

Usage

Fine-tuned Models

Contributing

Steps to Contribute

License

Acknowledgements

About

Releases

Packages

Languages

enoreese/remote-homology-llm-lora

Folders and files

Latest commit

History

Repository files navigation

Protein Remote Homology Detection Using Large Language Models and LoRA

Introduction

Project Structure

Requirements

Quick Start

1. Clone the Repository

2. Prepare the Data

3. Fine-tune the Model

4. Evaluate the Model

Usage

Fine-tuned Models

Contributing

Steps to Contribute

License

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages