Evaluation of large language models for discovery of gene set function

Description

Code associated with paper "Evaluation of large language models for discovery of gene set function"

Dependencies

Set up an environment

conda create -n llm_eval python=3.11.5

Set up an environment variable to store GPT-4 API key

conda activate llm_eval
conda env config vars set OPENAI_API_KEY="<your api key>" 
conda deactivate  # reactivate 

conda activate llm_eval
echo $OPENAI_API_KEY # make sure the key setup 

%python
import os
import openai
 
openai.api_key = os.environ["OPENAI_API_KEY"]

From OpenAI website for the best practice for API key safety

Python requirements:

The code was developed using Python 3.11.5.

git clone [email protected]:idekerlab/llm_evaluation_for_gene_set_interpretation.git

cd llm_evaluation_for_gene_set_interpretation

pip install -r requirements.txt

UPDATE 12/17/2024: openai package requires an httpx version that is not compatible with their function, manually downgrade httpx to 0.27.2 until OpenAI fixed their bug

pip uninstall httpx
pip install httpx==0.27.2

DDOT is required for downloading GO and can be installed in one of two ways:

To install DDOT by downloading the zip file of the source tree:

wget https://github.com/idekerlab/ddot/archive/refs/heads/python3.zip
unzip python3.zip
cd ddot-python3
python setup.py bdist_wheel
pip install dist/ddot*py3*whl

To install DDOT by cloning the repo:

git clone --branch python3 https://github.com/idekerlab/ddot.git
cd ddot
python setup.py bdist_wheel
pip install dist/ddot*py3*whl

Documentation

The notebooks are numbered according to the evaluation steps

Data Preperation (this step can be omitted for testing purposes)

The data is already in the data directory (refer to the README in this directory for detail information about the data)

If need to download GO, follow the code below:
```
## download and parse GO_BP terms
outdir = 'data/GO_BP/'
namespace = 'biological_process'
python process_the_gene_ontology.py $outdir --namespace $namespace 
```
and the notebook for parsing GO terms

The addition of contamination to the gene set is filed in this notebook

If need to download Omics data, run notebook. The notebook processes the omics data and saves them into a tab delimited text file.

Query GPT-4 for names and supporting analysis and run functional enrichment

GO gene set GPT-4 analysis is stored in Run_LLM_analysis

GO gene set analysis with different models

Batch run 1000 GO terms using slurm job with the parameter file

omic gene set GPT-4 analysis and omics gene set gProfiler

## example code to process from 1st to 5th terms in the table
# run in the command line  

input_file='data/GO_term_analysis/toy_example.csv' #input table path
config='./jsonFiles/GOLLMrun_config.json' #configuration file 
set_index='GO' #index of the table
gene_column='Genes' #name of the gene list column
start=0
end=5   
out_file='data/GO_term_analysis/LLM_processed_toy_example_gpt_4' #output path prefix

source activate llm_eval
# Run the Python script for the given range
python query_llm_for_analysis.py --config $config \
            --initialize \
            --input $input_file \
            --input_sep  ','\
            --set_index $set_index \
            --gene_column $gene_column\
            --gene_sep ' ' \
            --start $start \
            --end $end \
            --output_file $out_file

Semantic Similarity evaluation of names

GO gene set analysis evalution

# get the ranking of similarities from the GO gene set analysis

python rank_GOterm_LLM_sim_rand.py --input_file ./data/GO_term_analysis/LLM_processed_toy_example_w_contamination_gpt_4.tsv --emb_file data/all_go_terms_embeddings_dict.pkl --topn 3 --output_file ./data/GO_term_analysis/simrank_LLM_processed_toy_example.tsv --background_file data/GO_term_analysis/all_go_sim_scores_toy.txt

Further evaluation of the performance: model comparison evaluation, gene set functional enrichment, and gene set similarity comparison Evaluation Task 1 related

Model Comparison

Analysis related to Fig. 2A Compare the semantic similarities between models

Analysis related to Fig. 3 Run GO gene set functional enrichment for control

Compare the confidence score between real, contaminated, and random gene sets

Check broader concepts of the LLM names

Analysis for Fig. 2d

Analysis for whether the best matching GO term is a broader concept as the queried term

Evaluation Task 2 related Count genes supporting LLM name, then calculate LLM name Jaccard Index

Analysis related to Fig.4

Omics data naming evaluation

Evaluate LLM name matching with any significantly enriched GO term name, use this notebook
Development and assessment of the citation module
Quantification of citation module check citation module
Visualization of results

extended data fig.1 + Fig.2 + Fig.3

extract sub hierarchy (Fig.2e)

Omics figures (Fig 4, Extended Data Fig.5)

License

MIT License

Citing

Hu, M., Alkhairy, S., Lee, I. et al. Evaluation of large language models for discovery of gene set function. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02525-x

Name		Name	Last commit message	Last commit date
Latest commit History 255 Commits
.ipynb_checkpoints		.ipynb_checkpoints
UI		UI
data		data
figures		figures
jsonFiles		jsonFiles
logs		logs
supplementary_information		supplementary_information
test		test
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
0. [GO set]add_random_contamination.ipynb		0. [GO set]add_random_contamination.ipynb
0.[Omics_revamped]_ProcessOmicsData.ipynb		0.[Omics_revamped]_ProcessOmicsData.ipynb
0.[Prep GO]Download_and_parse_GO.ipynb		0.[Prep GO]Download_and_parse_GO.ipynb
1.[GO set]Run_LLM_analysis.ipynb		1.[GO set]Run_LLM_analysis.ipynb
1A.[GO set]Compare_models.ipynb		1A.[GO set]Compare_models.ipynb
1A.[Omics_revamped]GenerateLLM_analysis.ipynb		1A.[Omics_revamped]GenerateLLM_analysis.ipynb
1B-1[Omics_revamped]Gene_symbol_update.ipynb		1B-1[Omics_revamped]Gene_symbol_update.ipynb
1B-2.[Omics_revamped]run_gProfiler.ipynb		1B-2.[Omics_revamped]run_gProfiler.ipynb
2.[GO set]Rank_LLM_GO_term_pair_sim.ipynb		2.[GO set]Rank_LLM_GO_term_pair_sim.ipynb
2A.[Omics_revamped]_CountSupportingGenes.ipynb		2A.[Omics_revamped]_CountSupportingGenes.ipynb
2B.[Omics_revampled]Calculate_LLM_JI.ipynb		2B.[Omics_revampled]Calculate_LLM_JI.ipynb
3A.[model compare]compare_semantic_similarity.ipynb		3A.[model compare]compare_semantic_similarity.ipynb
3A.[omics_revamped]Analyze_gprofiler_annotation.ipynb		3A.[omics_revamped]Analyze_gprofiler_annotation.ipynb
3B-1.[model compare]functional_enrichment_analysis_control.ipynb		3B-1.[model compare]functional_enrichment_analysis_control.ipynb
3B-2.[model compare]Check_confidence_scoring_metrics.ipynb		3B-2.[model compare]Check_confidence_scoring_metrics.ipynb
3B[Omics_revamped]Compare_enriched_GO_and_LLM_name_sim.ipynb		3B[Omics_revamped]Compare_enriched_GO_and_LLM_name_sim.ipynb
3C.[GO set]Evaluate_gene_set_similarity.ipynb		3C.[GO set]Evaluate_gene_set_similarity.ipynb
4.Reference search and validation.ipynb		4.Reference search and validation.ipynb
5.Quantify reference checking.ipynb		5.Quantify reference checking.ipynb
6.[GO set] subhierarchy_GO_example.ipynb		6.[GO set] subhierarchy_GO_example.ipynb
6.[GO set]Plot_GO_analysis_figs.ipynb		6.[GO set]Plot_GO_analysis_figs.ipynb
6B.[Omics_revamped]GenerateOmicsFigures.ipynb		6B.[Omics_revamped]GenerateOmicsFigures.ipynb
LICENSE		LICENSE
README.md		README.md
[GO set]Count_genes_in_analysis.ipynb		[GO set]Count_genes_in_analysis.ipynb
[GO set]only_query_name_for_100terms.ipynb		[GO set]only_query_name_for_100terms.ipynb
constant.py		constant.py
excute_rank_similarity_model_compare.sh		excute_rank_similarity_model_compare.sh
hgnc_genes.tsv		hgnc_genes.tsv
hypergeometric_GO.py		hypergeometric_GO.py
model_compare_params.txt		model_compare_params.txt
model_comparison.sh		model_comparison.sh
paragraph_ref_data.json		paragraph_ref_data.json
paragraph_ref_data_revision.json		paragraph_ref_data_revision.json
process_the_gene_ontology.py		process_the_gene_ontology.py
query_llm_for_analysis.py		query_llm_for_analysis.py
rank_GOterm_LLM_sim_rand.py		rank_GOterm_LLM_sim_rand.py
requirements.txt		requirements.txt
run_omics_sem_sim.py		run_omics_sem_sim.py
semanticSimFunctions.py		semanticSimFunctions.py
test_toy_example.sh		test_toy_example.sh
thousandGOsets_CC_MF_GPT4Run_params.txt		thousandGOsets_CC_MF_GPT4Run_params.txt
thousandGOsets_GPT4Run.sh		thousandGOsets_GPT4Run.sh
thousandGOsets_GPT4Run_params.txt		thousandGOsets_GPT4Run_params.txt
toy_example_params.txt		toy_example_params.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of large language models for discovery of gene set function

Description

Dependencies

Set up an environment

Set up an environment variable to store GPT-4 API key

Python requirements:

Documentation

License

Citing

About

Releases

Packages

Contributors 5

Languages

License

idekerlab/llm_evaluation_for_gene_set_interpretation

Folders and files

Latest commit

History

Repository files navigation

Evaluation of large language models for discovery of gene set function

Description

Dependencies

Set up an environment

Set up an environment variable to store GPT-4 API key

Python requirements:

Documentation

License

Citing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages