Skip to content

Commit

Permalink
Update VespaG readme
Browse files Browse the repository at this point in the history
  • Loading branch information
aaronkollasch committed Nov 1, 2024
1 parent 429460f commit e1d603d
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion proteingym/baselines/vespag/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

To overcome the sparsity of experimental training data, authors created a dataset of 39 million single amino acid variants from a subset of the Human proteome, which was then annotated using predictions from the multiple sequence alignment-based effect predictor [GEMME](http://www.lcqb.upmc.fr/GEMME/Home.html) ([Laine et al. 2019](https://doi.org/10.1093/molbev/msz179)) as a proxy for experimental scores.

More details on **VespaG** can be found in the corresponding [preprint](https://doi.org/10.1101/2024.04.24.590982).
More details on **VespaG** can be found in the corresponding [repository](https://github.com/jschlensok/vespag) and [preprint](https://doi.org/10.1101/2024.04.24.590982).

### Installation
1. `conda env create -n vespag python==3.10 poetry==1.8.3` (exchange `conda` for `mamba`, `miniconda` or `micromamba` as you like)
Expand Down Expand Up @@ -48,12 +48,15 @@ Using DVC is non-optional. There is a `dvc.yaml` file in place that contains sta
You can reproduce model evaluation using the `eval` subcommand, which pre-processes data into a format usable by VespaG, runs `predict`, and computes performance metrics.

#### ProteinGym
Download the pre-computed ESM embeddings for the ProteinGym proteins [here](https://marks.hms.harvard.edu/proteingym/baseline_dependencies/VespaG/proteingym_esm2_embeddings.h5) and the VespaG model weights [here](https://marks.hms.harvard.edu/proteingym/baseline_dependencies/VespaG/state_dict_v2.pt).

Run evaluation on the ProteinGym DMS substituions benchmark with `python -m vespag eval proteingym`, with the following options:
**Optional:**
- `--reference-file`: Path to ProteinGym reference file. Will download to `data/test/proteingym217/reference.csv` or `data/test/proteingym87/reference.csv` if not provided.
- `--dms-directory`: Path to directory containing per-DMS score files in CSV format. Will download to `data/test/proteingym217/raw_dms_files/` or `data/test/proteingym87/raw_dms_files/` if not provided.
- `--output/-o`:Path for saving created CSV with scores for all assays and variants as well as a CSV with Spearman correlation coefficients for each DMS. Defaults to `./output/proteingym217` or `./output/proteingym87`.
- `--embeddings/-e`, `--id-map`, `--normalize-scores`: identical to `predict`, used for the internal call to it.
- `--checkpoint-file`: path to the VespaG model weights file.
- `--v1` if you want to get a result for the first iteration of ProteinGym with 87 assays.

## Preprint Citation
Expand Down

0 comments on commit e1d603d

Please sign in to comment.