Vision Open Protein Instructions(OPI) is the initial part of Open Biology Instructions(OBI) project, together with the subsequent Open Molecule Instructions(OMI), Open DNA Instructions(ODI), Open RNA Instructions(ORI) and Open Single-cell Instructions (OSCI). OBI is a project which aims to fully leverage the potential ability of Large Language Models(LLMs), especially the scientific LLMs like Galactica, to facilitate research in AI for Life Science community. While OBI is still in an early stage, we hope to provide a starting point for the community to bridge LLMs and biological domain knowledge.
This work has been accepted by NeurIPS 2024 Workshop: Foundation Models for Science: Progress, Opportunities, and Challenges.
OPI Dataset
OPI-Llama-3.1-8B-Instruct
OPI-Galactica-6.7B
- Project Overview
- OPI dataset construction pipeline
- OPI dataset overview
- OPEval: Nine evaluation tasks using the OPI dataset
- Instruction tuning with OPI training data
- Evaluating with OPI testing data
- Evaluation results
- Prediction comparison with SOTA mdoels
- Demo
- Acknowledgement
- Contact Information
This repo is for the Open Protein Instructions (OPI) project, aiming to build and release a high-quality and comprehensive protein instruction dataset with which LLMs can be adapted to protein-related tasks via instruction tuning and evaluated on these tasks.
Usage and license notices: Galactica is intended and licensed for research use only. Llama-3 is licensed for researchers and commercial entities, upholding the principles of openness. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The weight diff for Stanford Alpaca is also CC BY NC 4.0 (allowing only non-commercial use).
The OPI dataset is curated on our own by extracting key information from Swiss-Prot database. The following figure shows the overall construction process of OPI.
- An example of OPI training data:
instruction:
What is the EC classification of the input protein sequence based on its biological function?
input:
MGLVSSKKPDKEKPIKEKDKGQWSPLKVSAQDKDAPPLPPLVVFNHLTPPPPDEHLDEDKHFVVALYDYTAMNDRDLQMLKGEKLQVLKGTGDWWLARS
LVTGREGYVPSNFVARVESLEMERWFFRSQGRKEAERQLLAPINKAGSFLIRESETNKGAFSLSVKDVTTQGELIKHYKIRCLDEGGYYISPRITFPSL
QALVQHYSKKGDGLCQRLTLPCVRPAPQNPWAQDEWEIPRQSLRLVRKLGSGQFGEVWMGYYKNNMKVAIKTLKEGTMSPEAFLGEANVMKALQHERLV
RLYAVVTKEPIYIVTEYMARGCLLDFLKTDEGSRLSLPRLIDMSAQIAEGMAYIERMNSIHRDLRAANILVSEALCCKIADFGLARIIDSEYTAQEGAK
FPIKWTAPEAIHFGVFTIKADVWSFGVLLMEVVTYGRVPYPGMSNPEVIRNLERGYRMPRPDTCPPELYRGVIAECWRSRPEERPTFEFLQSVLEDFYT
ATERQYELQP
output:
2.7.10.2
- An example of OPI testing data:
{"id": "seed_task_0", "name": "EC number of price dataset from CLEAN", "instruction":
"Return the EC number of the protein sequence.", "instances": [{"input":
"MAIPPYPDFRSAAFLRQHLRATMAFYDPVATDASGGQFHFFLDDGTVYNTHTRHLVSATRFVVTHAMLYRTTGEARYQVGMRHALEFLRTAFLDPATGGY
AWLIDWQDGRATVQDTTRHCYGMAFVMLAYARAYEAGVPEARVWLAEAFDTAEQHFWQPAAGLYADEASPDWQLTSYRGQNANMHACEAMISAFRATGERR
YIERAEQLAQGICQRQAALSDRTHAPAAEGWVWEHFHADWSVDWDYNRHDRSNIFRPWGYQVGHQTEWAKLLLQLDALLPADWHLPCAQRLFDTAVERGWD
AEHGGLYYGMAPDGSICDDGKYHWVQAESMAAAAVLAVRTGDARYWQWYDRIWAYCWAHFVDHEHGAWFRILHRDNRNTTREKSNAGKVDYHNMGACYDVL
LWALDAPGFSKESRSAALGRP", "output": "5.3.1.7"}], "is_classification": false}
We are excited to announce the release of the OPI dataset, a curated collection of instructions covering 9 tasks for adapting LLMs to protein biology. The dataset is designed to advance LLM-driven research in the field of protein biology. We welcome contributions and enhancements to this dataset from the community. Thera are 1.64M samples, including training (1,615,661) and testing (26,607) sets, in OPI dataset.
Accessing the OPI dataset: The complete OPI dataset can be accessed from Hugging Face, which is organized into the three subfolders—AP, KM, and SU— in the OPI_DATA directory, plusing the full dataset file OPI_full_1.61M_train.json. Once downloaded, you can place all the subfolders and data files in the OPI_DATA folder within the repository. If you want to merge all or several training data files of the tasks into one single training data file, please do like this:
cd OPI_DATA
python merge_task_train_data.py --output OPI_merged_train.json
OPI Dataset folder structure:
./OPI_DATA/
└── SU
│ ├── EC_number
│ │ ├── test
│ │ │ ├── CLEAN_EC_number_new_test.jsonl
│ │ │ └── CLEAN_EC_number_price_test.jsonl
│ │ └── train
│ │ ├── CLEAN_EC_number_train.json
│ ├── Fold_type
│ │ ├── test
│ │ │ └── fold_type_test.jsonl
│ │ └── train
│ │ └── fold_type_train.json
│ └── Subcellular_localization
│ ├── test
│ │ ├── subcell_loc_test.jsonl
│ └── train
└── subcell_loc_train.json
├── AP
│ └── Keywords
│ │ ├── test
│ │ │ ├── CASPSimilarSeq_keywords_test.jsonl
│ │ │ ├── IDFilterSeq_keywords_test.jsonl
│ │ │ └── UniProtSeq_keywords_test.jsonl
│ │ └── train
│ │ ├── keywords_train.json
│ ├── GO
│ │ ├── test
│ │ │ ├── CASPSimilarSeq_go_terms_test.jsonl
│ │ │ ├── IDFilterSeq_go_terms_test.jsonl
│ │ │ └── UniProtSeq_go_terms_test.jsonl
│ │ └── train
│ │ ├── go_terms_train.json
│ ├── Function
│ ├── test
│ │ ├── CASPSimilarSeq_function_test.jsonl
│ │ ├── IDFilterSeq_function_test.jsonl
│ │ └── UniProtSeq_function_test.jsonl
│ └── train
│ ├── function_train.json
├── KM
└── gSymbol2Tissue
│ ├── test
│ │ └── gene_symbol_to_tissue_test.jsonl
│ └── train
│ └── gene_symbol_to_tissue_train.json
├── gSymbol2Cancer
│ ├── test
│ │ └── gene_symbol_to_cancer_test.jsonl
│ └── train
│ └── gene_symbol_to_cancer_train.json
├── gName2Cancer
├── test
│ └── gene_name_to_cancer_test.jsonl
└── train
└── gene_name_to_cancer_train.json
To assess the effectiveness of instruction tuning with the OPI dataset, we developed OPEval, which comprises three categories of evaluation tasks. Each category includes three specific tasks. The table below outlines the task types, names, and the corresponding sizes of the training and testing sets.
Task Type | Type Abbr. | Task Name | Task Abbr. | Training set size | Testing set size |
---|---|---|---|---|---|
Sequence Understanding | SU | EC Number Prediction | EC_number | 74,487 | 392 (NEW-392), 149 (Price-149) |
Fold Type Prediction | Fold_type | 12,312 | 718 (Fold), 1254 (Superfamily), 1272 (Family) | ||
Subcellular Localization Prediction | Subcellular_localization | 11,230 | 2,772 | ||
Annotation Prediction | AP | Function Keywords Prediction | Keywords | 451,618 | 184 (CASPSimilarSeq), 1,112 (IDFilterSeq), 4562 (UniprotSeq) |
Gene Ontology(GO) Terms Prediction | GO | 451,618 | 184 (CASPSimilarSeq), 1,112 (IDFilterSeq), 4562 (UniprotSeq) | ||
Function Description Prediction | Function | 451,618 | 184 (CASPSimilarSeq), 1,112 (IDFilterSeq), 4562 (UniprotSeq) | ||
Knowledge Mining | KM | Tissue Location Prediction from Gene Symbol | gSymbol2Tissue | 8,723 | 2,181 |
Cancer Prediction from Gene Symbol | gSymbol2Cancer | 590 | 148 | ||
Cancer Prediction from Gene Name | gName2Cancer | 590 | 148 |
Instruction tuning procedures are available in the instruction_tuning guide.
Accessing the OPI-Tuned Models: We have released the OPI-Llama-3.1-8B-Instruct and OPI-Galactica-6.7B models fine-tuned using OPI_full_1.61M_train.json, which can be accessed from Hugging Face.
Evalution procedures are outlined in the evaluation guide.
Comprehensive evaluation results are detailed in th evaluation_results document.
Prediction by OPI-tuned model, GPT-4o, Llama-3.1-8B-Instruct, Claude 3.5 Sonnet vs. Ground Trurh Answers are shown in in the model_compare document.
We use the FastChat platform to visually demonstrate the ability of OPI-Galactica-6.7B model on various evaluation tasks.
The codes are adapted from Stanford Alpaca.
Some codes are adapted from Chinese-LLaMA-Alpaca.
Llama-3: Llama-3
Galactica: Galactica
For help or issues using the repos, please submit a GitHub issue.
For other communications, please contact Qiwei Ye ([email protected]).