diff --git a/proteingym/baselines/vespag/README.md b/proteingym/baselines/vespag/README.md index 7a5f7db..92ebb91 100644 --- a/proteingym/baselines/vespag/README.md +++ b/proteingym/baselines/vespag/README.md @@ -7,7 +7,7 @@ To overcome the sparsity of experimental training data, authors created a dataset of 39 million single amino acid variants from a subset of the Human proteome, which was then annotated using predictions from the multiple sequence alignment-based effect predictor [GEMME](http://www.lcqb.upmc.fr/GEMME/Home.html) ([Laine et al. 2019](https://doi.org/10.1093/molbev/msz179)) as a proxy for experimental scores. -More details on **VespaG** can be found in the corresponding [preprint](https://doi.org/10.1101/2024.04.24.590982). +More details on **VespaG** can be found in the corresponding [repository](https://github.com/jschlensok/vespag) and [preprint](https://doi.org/10.1101/2024.04.24.590982). ### Installation 1. `conda env create -n vespag python==3.10 poetry==1.8.3` (exchange `conda` for `mamba`, `miniconda` or `micromamba` as you like) @@ -48,12 +48,15 @@ Using DVC is non-optional. There is a `dvc.yaml` file in place that contains sta You can reproduce model evaluation using the `eval` subcommand, which pre-processes data into a format usable by VespaG, runs `predict`, and computes performance metrics. #### ProteinGym +Download the pre-computed ESM embeddings for the ProteinGym proteins [here](https://marks.hms.harvard.edu/proteingym/baseline_dependencies/VespaG/proteingym_esm2_embeddings.h5) and the VespaG model weights [here](https://marks.hms.harvard.edu/proteingym/baseline_dependencies/VespaG/state_dict_v2.pt). + Run evaluation on the ProteinGym DMS substituions benchmark with `python -m vespag eval proteingym`, with the following options: **Optional:** - `--reference-file`: Path to ProteinGym reference file. Will download to `data/test/proteingym217/reference.csv` or `data/test/proteingym87/reference.csv` if not provided. - `--dms-directory`: Path to directory containing per-DMS score files in CSV format. Will download to `data/test/proteingym217/raw_dms_files/` or `data/test/proteingym87/raw_dms_files/` if not provided. - `--output/-o`:Path for saving created CSV with scores for all assays and variants as well as a CSV with Spearman correlation coefficients for each DMS. Defaults to `./output/proteingym217` or `./output/proteingym87`. - `--embeddings/-e`, `--id-map`, `--normalize-scores`: identical to `predict`, used for the internal call to it. +- `--checkpoint-file`: path to the VespaG model weights file. - `--v1` if you want to get a result for the first iteration of ProteinGym with 87 assays. ## Preprint Citation