Skip to content

A collection of awesome bio-foundation models, including protein, RNA, DNA, gene, single-cell, and so on.

Notifications You must be signed in to change notification settings

ekkkkki/Awesome-Bio-Foundation-Models

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

🧬📝 Awesome Bio-Foundation Models

Awesome Stars

The repository is a collection of awesome bio-foundation modeling papers, various domains include DNA, RNA, gene, protein, single-cell, and multimodalities.

🌟 If you'd like to add a paper or resource, feel free to submit a pull request or open an issue.

Table of Content


Models

The following logo represents:

paper publisher with paper link

code link

Model model link

Papers are ranked chronologically.

DNA & Gene

  • (Enformer) Effective gene expression prediction from sequence by integrating long-range interactions

    Dynamic JSON Badge Stars

  • MoDNA: motif-oriented pre-training for DNA language model

    Dynamic JSON Badge

  • Obtaining genetics insights from deep learning via explainable artificial intelligence

    Dynamic JSON Badge

  • Deciphering microbial gene function using natural language processing

    Dynamic JSON Badge Stars

  • MoDNA: Motif-Oriented Pre-training For DNA Language Model

    Dynamic JSON Badge

  • To Transformers and Beyond: Large Language Models for the Genome

    Dynamic JSON Badge

  • GeneCompass: Deciphering Universal Gene Regulatory Mechanisms with Knowledge-Informed Cross-Species Foundation Model

    Dynamic JSON Badge

  • HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

    Dynamic JSON Badge Stars Model

  • (GeneFormer) Transfer learning enables predictions in network biology

    Dynamic JSON Badge Model

  • DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

    Dynamic JSON Badge Stars Model

  • Species-aware DNA language modeling

    Dynamic JSON Badge Stars

  • DNA language models are powerful predictors of genome-wide variant effects

    Dynamic JSON Badge Stars

  • GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction

    Dynamic JSON Badge Stars

  • GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences

    Dynamic JSON Badge Stars Model

  • EpiGePT: a Pretrained Transformer model for epigenomics

    Dynamic JSON Badge Stars Model

  • DNAGPT: A Generalized Pre-trained Tool for Multiple DNA Sequence Analysis Tasks

    Dynamic JSON Badge Stars Model

  • The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics

    Dynamic JSON Badge Stars Model

  • DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome

    Dynamic JSON Badge Stars Model

  • DNABERT-S: Learning Species-Aware DNA Embedding with Genome Foundation Models

    Dynamic JSON Badge Stars Model

  • Single-cell gene expression prediction from DNA sequence at large contexts

    Dynamic JSON Badge

  • Genomic language model predicts protein co-regulation and function

    Dynamic JSON Badge Stars Model

RNA

  • Clustering and classification methods for single-cell RNA-sequencing data

    Dynamic JSON Badge

  • EMDLP: Ensemble multiscale deep learning model for RNA methylation site prediction

    Dynamic JSON Badge Stars

  • scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data

    Dynamic JSON Badge Stars Model

  • scPML: pathway-based multi-view learning for cell type annotation from single-cell RNA-seq data

    Dynamic JSON Badge Stars

  • (RNA-FM) Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions

    Dynamic JSON Badge Stars Model

  • (RNABERT) Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning

    Dynamic JSON Badge Stars Model

  • (MRM-BERT) Prediction of Multiple Types of RNA Modifications via Biological Language Model

    Dynamic JSON Badge

  • (SpliceBERT) Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction

    Dynamic JSON Badge Stars Model

  • UNI-RNA: universal pre-trained models revolutionize RNA research

    Dynamic JSON Badge

  • A Deep Dive into Single-Cell RNA Sequencing Foundation Models

    Dynamic JSON Badge Stars

  • xTrimoGene: An Efficient and Scalable Representation Learner for Single-Cell RNA-Seq Data

    Dynamic JSON Badge

  • trRosettaRNA: automated prediction of RNA 3D structure with transformer network

    Dynamic JSON Badge

  • (RfamGen) Deep generative design of RNA family sequences

    Dynamic JSON Badge Stars

  • (RNA-MFM) Multiple sequence alignment-based RNA language model and its application to structural inference

    Dynamic JSON Badge

  • RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks

    Dynamic JSON Badge Stars Model

  • ATOM-1: A Foundation Model for RNA Structure and Function Built on Chemical Mapping Data

    Dynamic JSON Badge

  • RNAformer: A Simple Yet Effective Deep LearningModel for RNA Secondary Structure Prediction

    Dynamic JSON Badge Stars Model

  • scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data

    Dynamic JSON Badge Stars

  • CellPLM: Pre-training of Cell Language Model Beyond Single Cells

    Stars

  • Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

    Dynamic JSON Badge Stars

  • A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions

    Dynamic JSON Badge Stars

  • ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations

    Dynamic JSON Badge Stars

Protein

  • Parapred: antibody paratope prediction using convolutional and recurrent neural networks

    Dynamic JSON Badge Stars

  • (UniRep) Unified rational protein engineering with sequence-based deep representation learning

    Dynamic JSON Badge Stars

  • (TAPE) Evaluating Protein Transfer Learning with TAPE

    Dynamic JSON Badge Stars

  • ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing

    Dynamic JSON Badge Stars Model

  • (ESM) Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

    Dynamic JSON Badge Stars Model

  • (ESM-1v) Language models enable zero-shot prediction of the effects of mutations on protein function

    Dynamic JSON Badge Stars

  • (IgLM) Generative language modeling for antibody design

    Dynamic JSON Badge Stars

  • (ESM-2 & ESMFold) Language models of protein sequences at the scale of evolution enable accurate structure prediction

    Stars Model

  • ProtGPT2 is a deep unsupervised language model for protein design

    Dynamic JSON Badge

  • ProteinBERT: a universal deep-learning model of protein sequence and function

    Dynamic JSON Badge Stars Model

  • OntoProtein: Protein Pretraining With Gene Ontology Embedding

    Dynamic JSON Badge Stars

  • (AntiBERTa) Deciphering the language of antibodies using self-supervised learning

    Dynamic JSON Badge Stars

  • AbLang: an antibody language model for completing antibody sequences

    Dynamic JSON Badge Stars

  • ProGen2: Exploring the boundaries of protein language models

    Dynamic JSON Badge Stars

  • SaProt: Protein Language Modeling with Structure-aware Vocabulary

    Dynamic JSON Badge Stars Model

  • Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling

    Dynamic JSON Badge Stars Model

  • (GearNet) Protein Representation Learning by Geometric Structure Pretraining

    Dynamic JSON Badge

  • ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

    Dynamic JSON Badge Stars

  • Efficient evolution of human antibodies from general protein language models

    Dynamic JSON Badge Stars

  • (CARP) Convolutions are competitive with transformers for protein sequence pretraining

    Dynamic JSON Badge Stars

  • (HelixFold-Single) A method for multiple-sequence-alignment-free protein structure prediction using a protein language model

    Dynamic JSON Badge Stars

  • (ABGNN) Pre-training Antibody Language Models for Antigen-Specific Computational Antibody Design

    Dynamic JSON Badge Stars Model

  • (ReprogBert) Reprogramming Pretrained Language Models for Antibody Sequence Infilling

    Dynamic JSON Badge Stars

  • ProteinFlow: a Python Library to Pre-Process Protein Structure Data for Deep Learning Applications

    Dynamic JSON Badge Stars

  • xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein

    Dynamic JSON Badge

  • ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing

    Dynamic JSON Badge Stars Model

  • (ESM-GearNet) A Systematic Study of Joint Representation Learning on Protein Sequences and Structures

    Dynamic JSON Badge Stars

  • (ProteinINR) Pre-training Sequence, Structure, and Surface Features for Comprehensive Protein Representation Learning

    Stars

  • (CaLM) Codon language embeddings provide strong signals for use in protein engineering

    Dynamic JSON Badge Stars

  • (DeepGo) Protein function prediction as approximate semantic entailment

    Dynamic JSON Badge Stars

  • PLMSearch: Protein language model powers accurate and fast sequence search for remote homology

    Dynamic JSON Badge Stars

  • Genomic language model predicts protein co-regulation and function

    Dynamic JSON Badge Stars Model

  • Prot2Token: A multi-task framework for protein language processing using autoregressive language modeling

    Dynamic JSON Badge Stars

  • (ESM-3) Simulating 500 million years of evolution with a language model

    Dynamic JSON Badge Stars

  • Training Compute-Optimal Protein Language Models

    Dynamic JSON Badge

  • Fine-tuning protein language models boosts predictions across diverse tasks

    Dynamic JSON Badge Stars

  • Contextual AI models for single-cell protein biology

    Dynamic JSON Badge Stars

  • Sequence-to-sequence translation from mass spectra to peptides with a transformer model

    Dynamic JSON Badge Stars

  • OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

    Dynamic JSON Badge Stars

  • ProteinInvBench: Benchmarking Protein Inverse Folding on Diverse Tasks, Models, and Metrics

    Dynamic JSON Badge Stars

  • ProteinShake: Building datasets and benchmarks for deep learning on protein structures

    Dynamic JSON Badge Stars

  • ProteinBench: A Holistic Evaluation of Protein Foundation Models

    Dynamic JSON Badge

  • An end-to-end framework for the prediction of protein structure and fitness from single sequence

    Dynamic JSON Badge Stars

Protein foundation models are hot topics, more papers can be found in

Single-cell

  • (DCell) Using deep learning to model the hierarchical structure and function of a cell

    Dynamic JSON Badge Stars

  • scVAE: variational auto-encoders for single-cell gene expression data

    Dynamic JSON Badge Stars

  • A sandbox for prediction and integration of DNA, RNA, and proteins in single cells

    Dynamic JSON Badge

  • scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data

    Dynamic JSON Badge Stars Model

  • scPML: pathway-based multi-view learning for cell type annotation from single-cell RNA-seq data

    Dynamic JSON Badge Stars

  • (scFoundation) Large Scale Foundation Model on Single-cell Transcriptomics

    Dynamic JSON Badge Stars Model

  • (DPI) Modeling and analyzing single-cell multimodal data with deep parametric inference

    Dynamic JSON Badge Stars

  • (ScPROTEIN) A Versatile Deep Graph Contrastive Learning Framework for Single-cell Proteomics Embedding

    Dynamic JSON Badge Stars

  • scGPT: toward building a foundation model for single-cell multi-omics using generative AI

    Dynamic JSON Badge Stars Model

  • scMulan: a multitask generative pre-trained language model for single-cell analysis

    Dynamic JSON Badge Stars Model

  • scDiffusion: conditional generation of high-quality single-cell data using diffusion model

    Dynamic JSON Badge Stars Model

  • Cell2Sentence: Teaching Large Language Models the Language of Biology

    Dynamic JSON Badge Stars

  • CellPLM: Pre-training of Cell Language Model Beyond Single Cells

    Dynamic JSON Badge Stars

  • Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

    Dynamic JSON Badge Stars

Multimodalities

  • (DPI) Modeling and analyzing single-cell multimodal data with deep parametric inference

    Dynamic JSON Badge Stars

  • Pretraining model for biological sequence data

    Dynamic JSON Badge

  • BioSeq-BLM: a platform for analyzing DNA, RNA and protein sequences based on biological language models

    Dynamic JSON Badge

  • A sandbox for prediction and integration of DNA, RNA, and proteins in single cells

    Dynamic JSON Badge

  • Galactica: A Large Language Model for Science

    Dynamic JSON Badge Stars Model

  • BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations

    Dynamic JSON Badge Stars Model

  • DARWIN Series: Domain Specific Large Language Models for Natural Science

    Dynamic JSON Badge Stars Model

  • (scMoFormer) Single-Cell Multimodal Prediction via Transformers

    Dynamic JSON Badge Stars

  • BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning

    Dynamic JSON Badge Stars Model

  • ChatCell: Facilitating Single-Cell Analysis with Natural Language

    Dynamic JSON Badge Stars Model

  • (Evo) Sequence modeling and design from molecular to genome scale with Evo

    Dynamic JSON Badge Stars Model

Related Resources

Related Surveys

  • Learning the protein language: Evolution, structure, and function

    Dynamic JSON Badge

  • Protein Language Models and Structure Prediction: Connection and Progression

    Dynamic JSON Badge

  • Progress and Opportunities of Foundation Models in Bioinformatics

    Dynamic JSON Badge

  • Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models

    Dynamic JSON Badge

  • Applications of transformer-based language models in bioinformatics: a survey

    Dynamic JSON Badge

  • Best practices for single-cell analysis across modalities

    Dynamic JSON Badge

  • Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey

    Dynamic JSON Badge

  • Scientific Large Language Models: A Survey on Biological & Chemical Domains

    Dynamic JSON Badge

  • Large language models in bioinformatics: applications and perspectives

    Dynamic JSON Badge

  • Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey

    Dynamic JSON Badge

  • Machine learning for functional protein design

    Dynamic JSON Badge

Related Repositories

About

A collection of awesome bio-foundation models, including protein, RNA, DNA, gene, single-cell, and so on.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published