GenePT
is a single-cell foundation model that leverages ChatGPT embeddings to tackle gene-level and cell-level biology tasks. This project is motivated by the significant recent progress in using large-scale (e.g., tens of millions of cells) gene expression data to develop foundation models for single-cell biology. These models implicitly learn gene and cellular functions from the gene expression profiles, which requires extensive data curation and resource-intensive training. By contrast, GenePT offers a complementary approach by using NCBI text descriptions of individual genes with GPT-3.5 to generate gene embeddings. From there, GenePT generates single-cell embeddings in two ways: (i) by averaging the gene embeddings, weighted by each gene’s expression level; or (ii) by creating a sentence embedding for each cell, using gene names ordered by the expression level.
Without the need for dataset curation and additional pretraining, GenePT is efficient and easy to use. On many downstream tasks used to evaluate recent single-cell foundation models --- e.g., classifying gene properties and cell types --- GenePT achieves comparable, and often better, performance than existing single-cell foundation models. GenePT demonstrates that large language model embedding of literature is a simple and effective path for biological foundation models.
The analysis scripts used to generate GenePT data and to reproduce the analysis in the paper can be found in the repo (with details for each script in the Breakdown of analysis files section below).
We also provide the following list of readily-available datasets that might be useful for a broader range of applications:
- Extracted summary texts scraped from the NCBI page for each gene.
- Pre-computed GPT-3.5 embeddings (
text-embedding-ada-002
) for each gene. These data are deposited at .
We provide example notebooks to run the following analyses:
- Gene-level prediction tasks
- Gene-gene interaction analysis analysis
- Cell-level biological data annotation
- Batch effect removal (Cardiomyocyte dataset; Aorta dataset)
Please file an issue if you have a request for a tutorial that is not currently included.
If you use GenePT
for your analysis, please cite our manuscript:
Chen YT, Zou J. (2023+) GenePT: A Simple But Hard-to-Beat Foundation Model for Genes and Cells Built From ChatGPT. bioRxiv preprint: https://www.biorxiv.org/content/10.1101/2023.10.16.562533v1.
- request_ncbi_text_for_genes.ipynb provides example code to download NCBI gene summary page.
- gene_embeddings_examples.ipynb provides example code to embed the extracted descriptions in 1 using GPT-3.5 embeddings. Note that this requires a valid registration of the OpenAI API (see instructions and pricing details at https://openai.com/blog/openai-api).
- gene_level_task_table_1.ipynb reproduces the gene level tasks for GenePT embeddings in Table 1 of the paper.
- gene_level_task_figure_2.ipynb provides the necessary data and output for the gene level tasks described in Figure 2 of our paper.
- aorta_data_analysis.ipynb provides example code to create your foundation-model cell embeddings in <20 lines of code and demonstrates the batch effect removal + biology preservation (Figure 4 results of the paper).
- For the gene-level tasks, we make use of the following datasets:
- Geneformer provides the datasets for gene-level tasks reported in Table 1 (https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files/gene_classification)
- Gene-gene interaction network datasets were collected as part of the Gene2vec paper, available at https://github.com/jingcheng-du/Gene2vec/tree/master/predictionData.
- For the cell-level tasks, we make use of the following datasets:
- hPancreas, Myeloid, and Multiple Sclerosis (processed and distributed by Cui et al. (2023+))
- Cardiomyocyte dataset:
- Original data can be downloaded here.
- We created a random 10% subset of the original dataset for our cell-level analysis, available at this google drive folder.
- Aorta dataset:
- The initial list of genes are curated from vocab.json (provided by scGPT authors) and token_dictionary.pkl.
- Chaffin, M. et al. Single-nucleus profiling of human dilated and hypertrophic cardiomyopathy. Nature 608, 174–180 (2022).
- Chen YT, Zou J. (2023+) GenePT: A Simple But Hard-to-Beat Foundation Model for Genes and Cells Built From ChatGPT. bioRxiv preprint: https://www.biorxiv.org/content/10.1101/2023.10.16.562533v1.
- Cui, H., et al. (2023). scGPT: Towards building a foundation model for single-cell multi-omics using generative AI. bioRxiv, 2023-04.
- Li, Y. et al. Single-cell transcriptome analysis reveals dynamic cell populations and differential gene expression patterns in control and aneurysmal human aortic tissue. Circulation 142, 1374–1388 (2020).