Skip to content

Commit

Permalink
Merge pull request #96 from ramanathanlab/develop
Browse files Browse the repository at this point in the history
Tokenizer update + Batch inference QOL changes
  • Loading branch information
braceal authored May 3, 2023
2 parents b2c3a7a + 8d2fd77 commit 71beb03
Show file tree
Hide file tree
Showing 17 changed files with 443 additions and 51 deletions.
25 changes: 22 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@ pip install git+https://github.com/ramanathanlab/genslm
GenSLMs were trained on the [Polaris](https://www.alcf.anl.gov/polaris) and [Perlmutter](https://perlmutter.carrd.co/) supercomputers. For installation on these systems, please see [`INSTALL.md`](https://github.com/ramanathanlab/genslm/blob/main/docs/INSTALL.md).

## Usage
> :warning: **Model weights will be unavailable May 5, 2023 to May 12, 2023**
> :warning: **Model weights downloaded prior to May 3, 2023 have a small issue in name space. Please redownload models for fix.**
Our pre-trained models and datasets can be downloaded from this [Globus Endpoint](https://app.globus.org/file-manager?origin_id=25918ad0-2a4e-4f37-bcfc-8183b19c3150&origin_path=%2F).

Expand All @@ -34,9 +37,14 @@ import numpy as np
from torch.utils.data import DataLoader
from genslm import GenSLM, SequenceDataset

# Load model
model = GenSLM("genslm_25M_patric", model_cache_dir="/content/gdrive/MyDrive")
model.eval()

# Select GPU device if it is available, else use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Input data is a list of gene sequences
sequences = [
"ATGAAAGTAACCGTTGTTGGAGCAGGTGCAGTTGGTGCAAGTTGCGCAGAATATATTGCA",
Expand All @@ -50,9 +58,15 @@ dataloader = DataLoader(dataset)
embeddings = []
with torch.no_grad():
for batch in dataloader:
outputs = model(batch["input_ids"], batch["attention_mask"], output_hidden_states=True)
for batch in dataloader:
outputs = model(
batch["input_ids"].to(device),
batch["attention_mask"].to(device),
output_hidden_states=True,
)
# outputs.hidden_states shape: (layers, batch_size, sequence_length, hidden_size)
emb = outputs.hidden_states[0].detach().cpu().numpy()
# Use the embeddings of the last layer
emb = outputs.hidden_states[-1].detach().cpu().numpy()
# Compute average over sequence length
emb = np.mean(emb, axis=1)
embeddings.append(emb)
Expand All @@ -67,11 +81,16 @@ embeddings.shape
```python
from genslm import GenSLM

# Load model
model = GenSLM("genslm_25M_patric", model_cache_dir="/content/gdrive/MyDrive")
model.eval()

# Select GPU device if it is available, else use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Prompt the language model with a start codon
prompt = model.tokenizer.encode("ATG", return_tensors="pt")
prompt = model.tokenizer.encode("ATG", return_tensors="pt").to(device)

tokens = model.model.generate(
prompt,
Expand Down
6 changes: 3 additions & 3 deletions docs/COMMANDS.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ python -m genslm.cmdline.remove_neox_attention_bias \
2. Setup a config file that looks like this:
```
load_pt_checkpoint: /home/hippekp/CVD-Mol-AI/hippekp/model_training/25m_genome_embeddings/model-epoch69-val_loss0.01.pt
tokenizer_file: /home/hippekp/github/genslm/genslm/tokenizer_files/codon_wordlevel_100vocab.json
tokenizer_file: /home/hippekp/github/genslm/genslm/tokenizer_files/codon_wordlevel_69vocab.json
data_file: $DATA.h5
embeddings_out_path: /home/hippekp/CVD-Mol-AI/hippekp/model_training/25m_genome_embeddings/train_embeddings/
model_config_json: /lus/eagle/projects/CVD-Mol-AI/hippekp/model_training/genome_finetuning_25m/config/neox_25,290,752.json
Expand Down Expand Up @@ -64,7 +64,7 @@ Converting a directory of fasta files into a directory of h5 files (Step one of
python -m genslm.cmdline.fasta_to_h5 \
--fasta $PATH_TO_FASTA_DIR \
--h5_dir $PATH_TO_OUTDIR \
--tokenizer_file ~/github/genslm/genslm/tokenizer_files/codon_wordlevel_100vocab.json
--tokenizer_file ~/github/genslm/genslm/tokenizer_files/codon_wordlevel_69vocab.json
```

Converting a directory of h5 files into a single h5 file (Step two of data preprocessing for pretraining, output of this step is what we use for pretraining)
Expand All @@ -83,7 +83,7 @@ Converting individual fasta files into individual h5 files (Useful for getting e
python -m genslm.cmdline.single_fasta_to_h5 \
-f $PATH_TO_SINGLE_FASTA \
--h5 $PATH_TO_SINGLE_H5 \
-t ~/github/genslm/genslm/tokenizer_files/codon_wordlevel_100vocab.json \
-t ~/github/genslm/genslm/tokenizer_files/codon_wordlevel_69vocab.json \
-b 10240 \
-n 16\
--train_val_test_split
Expand Down
14 changes: 11 additions & 3 deletions examples/embedding.ipynb

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 7 additions & 1 deletion examples/generate.ipynb

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ limit_val_batches: 32
check_val_every_n_epoch: 1
checkpoint_every_n_train_steps: 500
checkpoint_every_n_epochs: null
tokenizer_file: ../../genslm/tokenizer_files/codon_wordlevel_100vocab.json
tokenizer_file: ../../genslm/tokenizer_files/codon_wordlevel_69vocab.json
train_file: /path/to/data/first_year/first_year_train.h5
val_file: /path/to/data/first_year/first_year_val.h5
test_file: /path/to/data/first_year/first_year_val.h5
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ limit_val_batches: 32
check_val_every_n_epoch: 1
checkpoint_every_n_train_steps: 500
checkpoint_every_n_epochs: null
tokenizer_file: ../../genslm/tokenizer_files/codon_wordlevel_100vocab.json
tokenizer_file: ../../genslm/tokenizer_files/codon_wordlevel_69vocab.json
train_file: /path/to/data/first_year/first_year_train.h5
val_file: /path/to/data/first_year/first_year_val.h5
test_file: /path/to/data/first_year/first_year_val.h5
Expand Down
2 changes: 1 addition & 1 deletion examples/training/foundation_models/250M_foundation.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ limit_val_batches: 32
check_val_every_n_epoch: 1
checkpoint_every_n_train_steps: 500
checkpoint_every_n_epochs: null
tokenizer_file: ../../genslm/tokenizer_files/codon_wordlevel_100vocab.json
tokenizer_file: ../../genslm/tokenizer_files/codon_wordlevel_69vocab.json
train_file: /path/to/data/patric_89M/pgfam_30k_h5_tts/combined_train.h5
val_file: /path/to/data/patric_89M/pgfam_30k_h5_tts/combined_val.h5
test_file: /path/to/data/patric_89M/pgfam_30k_h5_tts/combined_test.h5
Expand Down
2 changes: 1 addition & 1 deletion examples/training/foundation_models/25B_foundation.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ limit_val_batches: 32
check_val_every_n_epoch: 1
checkpoint_every_n_train_steps: 50
checkpoint_every_n_epochs: null
tokenizer_file: ../../genslm/tokenizer_files/codon_wordlevel_100vocab.json
tokenizer_file: ../../genslm/tokenizer_files/codon_wordlevel_69vocab.json
train_file: /path/to/data/patric_89M/pgfam_30k_h5_tts/combined_train.h5
val_file: /path/to/data/patric_89M/pgfam_30k_h5_tts/combined_val.h5
test_file: /path/to/data/patric_89M/pgfam_30k_h5_tts/combined_test.h5
Expand Down
2 changes: 1 addition & 1 deletion examples/training/foundation_models/25M_foundation.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ limit_val_batches: 32
check_val_every_n_epoch: 1
checkpoint_every_n_train_steps: 500
checkpoint_every_n_epochs: null
tokenizer_file: ../../genslm/tokenizer_files/codon_wordlevel_100vocab.json
tokenizer_file: ../../genslm/tokenizer_files/codon_wordlevel_69vocab.json
train_file: /path/to/data/patric_89M/pgfam_30k_h5_tts/combined_train.h5
val_file: /path/to/data/patric_89M/pgfam_30k_h5_tts/combined_val.h5
test_file: /path/to/data/patric_89M/pgfam_30k_h5_tts/combined_test.h5
Expand Down
2 changes: 1 addition & 1 deletion examples/training/foundation_models/2B_foundation.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ wandb_active: true
wandb_project_name: codon_transformer
wandb_entity_name: gene_mdh_gan
checkpoint_dir: patric_2.5B_pretraining/checkpoints_v2/
tokenizer_file: ../../genslm/tokenizer_files/codon_wordlevel_100vocab.json
tokenizer_file: ../../genslm/tokenizer_files/codon_wordlevel_69vocab.json
train_file: /path/to/data/patric_89M/pgfam_30k_h5_tts/combined_train.h5
val_file: /path/to/data/patric_89M/pgfam_30k_h5_tts/combined_val.h5
test_file: /path/to/data/patric_89M/pgfam_30k_h5_tts/combined_test.h5
Expand Down
2 changes: 1 addition & 1 deletion genslm/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version__ = "0.0.3a1"
__version__ = "0.0.4a1"

# Public imports
from genslm.dataset import SequenceDataset # noqa
Expand Down
2 changes: 1 addition & 1 deletion genslm/cmdline/process_single_family_file.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ def main(input_fasta: Path, output_h5: Path, tokenizer_path: Path, block_size: i
"--tokenizer_file",
help="Path to tokenizer file",
default=(
fp.parent.parent / "genslm/tokenizer_files/codon_wordlevel_100vocab.json"
fp.parent.parent / "genslm/tokenizer_files/codon_wordlevel_69vocab.json"
),
)
parser.add_argument(
Expand Down
Loading

0 comments on commit 71beb03

Please sign in to comment.