Skip to content
/ opi Public

This repo is for the Open Protein Instructions (OPI) project, aiming to build and release a high-quality and comprehensive protein instruction dataset with which LLMs can be adapted to protein-related tasks via instruction tuning and evaluated on these tasks.

License

Notifications You must be signed in to change notification settings

baaihealth/opi

Repository files navigation

Code License Data License Weight Diff License

OPI: An Open Instruction Dataset for Adapting Large Language Models to Protein-Related Tasks

Vision Open Protein Instructions(OPI) is the initial part of Open Biology Instructions(OBI) project, together with the subsequent Open Molecule Instructions(OMI), Open DNA Instructions(ODI), Open RNA Instructions(ORI) and Open Single-cell Instructions (OSCI). OBI is a project which aims to fully leverage the potential ability of Large Language Models(LLMs), especially the scientific LLMs like Galactica, to facilitate research in AI for Life Science community. While OBI is still in an early stage, we hope to provide a starting point for the community to bridge LLMs and biological domain knowledge.

Paper

This work has been accepted by NeurIPS 2024 Workshop: Foundation Models for Science: Progress, Opportunities, and Challenges.

Hugging Face links to OPI dataset and OPI-tuned models

OPI Dataset
OPI-Llama-3.1-8B-Instruct
OPI-Galactica-6.7B

Contents

Project Overview

This repo is for the Open Protein Instructions (OPI) project, aiming to build and release a high-quality and comprehensive protein instruction dataset with which LLMs can be adapted to protein-related tasks via instruction tuning and evaluated on these tasks.

Usage and license notices: Galactica is intended and licensed for research use only. Llama-3 is licensed for researchers and commercial entities, upholding the principles of openness. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The weight diff for Stanford Alpaca is also CC BY NC 4.0 (allowing only non-commercial use).

OPI dataset construction pipeline

The OPI dataset is curated on our own by extracting key information from Swiss-Prot database. The following figure shows the overall construction process of OPI.

  • An example of OPI training data:
instruction: 
    What is the EC classification of the input protein sequence based on its biological function?
input:                         
    MGLVSSKKPDKEKPIKEKDKGQWSPLKVSAQDKDAPPLPPLVVFNHLTPPPPDEHLDEDKHFVVALYDYTAMNDRDLQMLKGEKLQVLKGTGDWWLARS
    LVTGREGYVPSNFVARVESLEMERWFFRSQGRKEAERQLLAPINKAGSFLIRESETNKGAFSLSVKDVTTQGELIKHYKIRCLDEGGYYISPRITFPSL
    QALVQHYSKKGDGLCQRLTLPCVRPAPQNPWAQDEWEIPRQSLRLVRKLGSGQFGEVWMGYYKNNMKVAIKTLKEGTMSPEAFLGEANVMKALQHERLV
    RLYAVVTKEPIYIVTEYMARGCLLDFLKTDEGSRLSLPRLIDMSAQIAEGMAYIERMNSIHRDLRAANILVSEALCCKIADFGLARIIDSEYTAQEGAK
    FPIKWTAPEAIHFGVFTIKADVWSFGVLLMEVVTYGRVPYPGMSNPEVIRNLERGYRMPRPDTCPPELYRGVIAECWRSRPEERPTFEFLQSVLEDFYT
    ATERQYELQP
output: 
    2.7.10.2
  • An example of OPI testing data:
{"id": "seed_task_0", "name": "EC number of price dataset from CLEAN", "instruction":
"Return the EC number of the protein sequence.", "instances": [{"input":
"MAIPPYPDFRSAAFLRQHLRATMAFYDPVATDASGGQFHFFLDDGTVYNTHTRHLVSATRFVVTHAMLYRTTGEARYQVGMRHALEFLRTAFLDPATGGY
AWLIDWQDGRATVQDTTRHCYGMAFVMLAYARAYEAGVPEARVWLAEAFDTAEQHFWQPAAGLYADEASPDWQLTSYRGQNANMHACEAMISAFRATGERR
YIERAEQLAQGICQRQAALSDRTHAPAAEGWVWEHFHADWSVDWDYNRHDRSNIFRPWGYQVGHQTEWAKLLLQLDALLPADWHLPCAQRLFDTAVERGWD
AEHGGLYYGMAPDGSICDDGKYHWVQAESMAAAAVLAVRTGDARYWQWYDRIWAYCWAHFVDHEHGAWFRILHRDNRNTTREKSNAGKVDYHNMGACYDVL
LWALDAPGFSKESRSAALGRP", "output": "5.3.1.7"}], "is_classification": false}

OPI dataset overview

We are excited to announce the release of the OPI dataset, a curated collection of instructions covering 9 tasks for adapting LLMs to protein biology. The dataset is designed to advance LLM-driven research in the field of protein biology. We welcome contributions and enhancements to this dataset from the community. Thera are 1.64M samples, including training (1,615,661) and testing (26,607) sets, in OPI dataset.

Accessing the OPI dataset: The complete OPI dataset can be accessed from Hugging Face, which is organized into the three subfolders—AP, KM, and SU— in the OPI_DATA directory, plusing the full dataset file OPI_full_1.61M_train.json. Once downloaded, you can place all the subfolders and data files in the OPI_DATA folder within the repository. If you want to merge all or several training data files of the tasks into one single training data file, please do like this:

cd OPI_DATA
python merge_task_train_data.py --output OPI_merged_train.json

OPI Dataset folder structure:

./OPI_DATA/
└── SU
│   ├── EC_number
│   │   ├── test
│   │   │   ├── CLEAN_EC_number_new_test.jsonl
│   │   │   └── CLEAN_EC_number_price_test.jsonl
│   │   └── train
│   │       ├── CLEAN_EC_number_train.json
│   ├── Fold_type
│   │   ├── test
│   │   │   └── fold_type_test.jsonl
│   │   └── train
│   │       └── fold_type_train.json
│   └── Subcellular_localization
│       ├── test
│       │   ├── subcell_loc_test.jsonl
│       └── train
            └── subcell_loc_train.json
├── AP
│   └── Keywords
│   │   ├── test
│   │   │   ├── CASPSimilarSeq_keywords_test.jsonl
│   │   │   ├── IDFilterSeq_keywords_test.jsonl
│   │   │   └── UniProtSeq_keywords_test.jsonl
│   │   └── train
│   │       ├── keywords_train.json
│   ├── GO
│   │   ├── test
│   │   │   ├── CASPSimilarSeq_go_terms_test.jsonl
│   │   │   ├── IDFilterSeq_go_terms_test.jsonl
│   │   │   └── UniProtSeq_go_terms_test.jsonl
│   │   └── train
│   │       ├── go_terms_train.json
│   ├── Function
│       ├── test
│       │   ├── CASPSimilarSeq_function_test.jsonl
│       │   ├── IDFilterSeq_function_test.jsonl
│       │   └── UniProtSeq_function_test.jsonl
│       └── train
│           ├── function_train.json
├── KM
    └── gSymbol2Tissue
    │   ├── test
    │   │   └── gene_symbol_to_tissue_test.jsonl
    │   └── train
    │       └── gene_symbol_to_tissue_train.json
    ├── gSymbol2Cancer
    │   ├── test
    │   │   └── gene_symbol_to_cancer_test.jsonl
    │   └── train
    │       └── gene_symbol_to_cancer_train.json
    ├── gName2Cancer
        ├── test
        │   └── gene_name_to_cancer_test.jsonl
        └── train
            └── gene_name_to_cancer_train.json

OPEval: Nine evaluation tasks using the OPI dataset

To assess the effectiveness of instruction tuning with the OPI dataset, we developed OPEval, which comprises three categories of evaluation tasks. Each category includes three specific tasks. The table below outlines the task types, names, and the corresponding sizes of the training and testing sets.

Task Type Type Abbr. Task Name Task Abbr. Training set size Testing set size
Sequence Understanding SU EC Number Prediction EC_number 74,487 392 (NEW-392), 149 (Price-149)
Fold Type Prediction Fold_type 12,312 718 (Fold), 1254 (Superfamily), 1272 (Family)
Subcellular Localization Prediction Subcellular_localization 11,230 2,772
Annotation Prediction AP Function Keywords Prediction Keywords 451,618 184 (CASPSimilarSeq), 1,112 (IDFilterSeq), 4562 (UniprotSeq)
Gene Ontology(GO) Terms Prediction GO 451,618 184 (CASPSimilarSeq), 1,112 (IDFilterSeq), 4562 (UniprotSeq)
Function Description Prediction Function 451,618 184 (CASPSimilarSeq), 1,112 (IDFilterSeq), 4562 (UniprotSeq)
Knowledge Mining KM Tissue Location Prediction from Gene Symbol gSymbol2Tissue 8,723 2,181
Cancer Prediction from Gene Symbol gSymbol2Cancer 590 148
Cancer Prediction from Gene Name gName2Cancer 590 148

Instruction tuning with OPI training data

Instruction tuning procedures are available in the instruction_tuning guide.

Accessing the OPI-Tuned Models: We have released the OPI-Llama-3.1-8B-Instruct and OPI-Galactica-6.7B models fine-tuned using OPI_full_1.61M_train.json, which can be accessed from Hugging Face.

Evaluating with OPI testing data

Evalution procedures are outlined in the evaluation guide.

Evaluation results

Comprehensive evaluation results are detailed in th evaluation_results document.

Prediction comparison with SOTA mdoels

Prediction by OPI-tuned model, GPT-4o, Llama-3.1-8B-Instruct, Claude 3.5 Sonnet vs. Ground Trurh Answers are shown in in the model_compare document.

Demo

We use the FastChat platform to visually demonstrate the ability of OPI-Galactica-6.7B model on various evaluation tasks.

OPI Demo

Acknowledgement

The codes are adapted from Stanford Alpaca.
Some codes are adapted from Chinese-LLaMA-Alpaca.
Llama-3: Llama-3
Galactica: Galactica

Contact Information

For help or issues using the repos, please submit a GitHub issue.
For other communications, please contact Qiwei Ye ([email protected]).

About

This repo is for the Open Protein Instructions (OPI) project, aiming to build and release a high-quality and comprehensive protein instruction dataset with which LLMs can be adapted to protein-related tasks via instruction tuning and evaluated on these tasks.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published