Skip to content

Thesis on applying Language Models to human O-GalNAc prediction

Notifications You must be signed in to change notification settings

Carbyne/NOG5Thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Thesis: Prediction of protein O-glycosylation using language models

The repository contains the source code for the updated version of NetOGlyc, which replaces external tools with embeddings from the pretrained model ESM-1b.

NOG5Thesis/
│
├── netoglyc5/ - model training/prediction library based on NetSurfP-3.0
│    │
│    ├── nog5/nog5/
│    │    │
│    │    ├── logging.yml - logging configuration
│    │    ├── cli.py - command line interface
│    │    ├── main.py - main script to start train/test
│    │    │
│    │    ├── base/ - abstract base classes
│    │    │
│    │    ├── dataloaders/ - anything about data loading goes here
│    │    │
│    │    ├── datasets/ - anything about dataset formats goes here
│    │    │
│    │    ├── embeddings/ - folder containing the ESM1b model
│    │    │
│    │    ├── models/ - models
│    │    │
│    │    ├── output/
│    │    │    │
│    │    │    ├── loss/ - loss functions
│    │    │    │
│    │    │    ├── metrics/ - performance metric functions
│    │    │    │
│    │    │    ├── misc/ - masking etc.
│    │    │    │
│    │    │    └── saving/ - prediction output savers
│    │    │
│    │    ├── predict/ - predictors
│    │    │
│    │    ├── train/ - trainers
│    │    │
│    │    ├── eval/ - evaluators
│    │    │
│    │    └── utils/ - utilities for logging and tensorboard visualization
│    │
│    ├── data/ - directory for storing input data
│    │
│    ├── study/ - directory for storing optuna studies
│    │
│    ├── models/ - directory for storing pre-trained models
│    │
│    └── saved/ - directory for checkpoints, logs and evaluation results
│
├── notebooks/ - directory of notebooks used for data pipeline
│
└── glyc_processing - MS-peptide glycosylation data cleaning library

Start by creating an environment to install the project requirements .. code-block:

$ conda env create --file environment_loose.yml
$ conda activate nog5

Now you can either use the package out of the box

$ cd netoglyc5/nog5
$ pip install .

Or develop further the project. This will create a symbolic link to the package. Changes to the source code will be automatically applied.

$ pip install -e .

Training a model based on a experiment configuration (includes evaluating in the end with best model)

$ nog5 train -c experiments/config.yml

Predicting, which uses a model and its prediction configuration .. code-block:

$ nog5 predict -c config.yml -d model.pth -i example_input.txt -o example_output.txt

Config files are in .yml format:

name: CNNTrans_NetOGlyc_NetSurfP
save_dir: saved/nog5
seed: 1234
target_devices: [0]

arch:
  type: CNNTrans_NetOGlyc_NetSurfP
  args:
    init_n_channels: 1280
    out_channels: 32
    cnn_layers: 2
    kernel_size: [129, 257]
    padding: [64, 128]
    n_head: 21
    dropout: 0.5
    encoder_layers: 2
    #embedding_pretrained: "models/esm1b_t33_650M_UR50S.pt"

dataloaders:
  training:
    paths: ["protein_embeddings_netsurfp_output_glyc_labels_max.h5"]
    type: BasicDataLoader
    args:
      batch_size: 16
      num_workers: 2
      shuffle: true
      validation_split: 0.05
      #training_indices: [0, 1, 2, 3]
      #validation_indices: [4, 44, 53, 71, 83]
    dataset:
      type: H5TrainEvalDataset
      args:
        truncate_seq_length: 1022
        embedding_features: 1280
        label_names: [ss8, dis, rsa, phi, psi, gly]
        label_sizes: [8, 1, 1, 1, 1, 1]
  testing:
    paths: ["protein_embeddings_netsurfp_output_glyc_labels_max.h5"]
    type: BasicDataLoader
    args:
      batch_size: 16
      num_workers: 2
      shuffle: false
      training_indices: [15, 25, 50, 66, 78, 87]
    dataset:
      type: H5TrainEvalDataset
      args:
        truncate_seq_length: 1022
        embedding_features: 1280
        label_names: [ss8, dis, rsa, phi, psi, gly]
        label_sizes: [8, 1, 1, 1, 1, 1]
  prediction:
    paths: ["protein_embeddings_netsurfp_output_glyc_labels_max.h5"]
    type: BasicDataLoader
    args:
      batch_size: 8
      num_workers: 2
      shuffle: false
      training_indices: [15, 25, 50, 66, 78, 87, 102]
    dataset:
      type: H5PredictionDataset
      args:
        embedding_features: 1280

prediction:
  allow_cuda: True
  log_step: 50
  #data_transform:
    #type: ESM1bTokenize
  saver:
    type: H5Saver
    args:
      #embedding_features: 1280
      label_names: [ss8, dis, rsa, phi, psi, gly]
      label_sizes: [8, 1, 1, 1, 1, 1]
      #target_is_output_labels: True
      #data_is_output_embeddings: True
    labels_transform: multi_task_save_output

training:
  early_stop: 3
  epochs: 100
  monitor: max val_gly_unambiguous_mcc
  save_period: 1
  log_step: 1
  tensorboard: true

optimizer:
  type: AdamW
  args:
    lr: 5.0e-05
    #weight_decay: 1.0e-3
    #momentum: 0.9

lr_scheduler:
  type: null

multitask_loss:
#  type: AutomaticWeightedLoss
  type: WeightedLoss
  args:
    loss_names: [ss8_bce, dis_mse, rsa_mse, phi_mse, psi_mse, gly_definite_mse]
    loss_weights: [1, 1, 1, 1, 1, 2000]
    loss_args: [{}, {}, {}, {}, {}, {positive_weight: 0.01}]

metrics: [ss8_pcc, ss3_pcc, dis_pcc, rsa_pcc, phi_mae, psi_mae, gly_pcc, gly_definite_mcc, gly_ambiguous_mcc, gly_fpr, gly_fnr]

Add additional configurations if you need.

Modify the configurations in .yml config files, then run:

$ nsp3 train -c experiments/<config>.yml

You can resume from a previously saved checkpoint by:

nsp3 train -c experiments/<config>.yml -r path/to/checkpoint.pth

You can specify the name of the training session in config files:

"name": "CNNTrans_NetOGlyc_NetSurfP"

The checkpoints will be saved in save_dir/name/timestamp/checkpoints/, with timestamp in YYYY-mmdd-HHMMSS format.

A copy of config file will be saved in the same folder.

Note: checkpoints contain:

checkpoint = {
  'arch': arch,
  'epoch': epoch,
  'state_dict': self.model.state_dict(),
  'optimizer': self.optimizer.state_dict(),
  'monitor_best': self.mnt_best,
  'config': self.config,
  'loss': self.loss.state_dict(), # Only if using AutomaticWeightedLoss
}

This template supports https://pytorch.org/docs/stable/tensorboard.html visualization.

  1. Run training

    Set tensorboard option in config file true.

  2. Open tensorboard server

    Type tensorboard --logdir saved/experiment_name_here/ at the project root, then server will open at http://localhost:6006

By default, values of loss and metrics specified in config file will be logged. If you need more visualizations, use add_scalar('tag', data), add_image('tag', image), etc in the trainer._train_epoch method. add_something() methods in this template are basically wrappers for those of tensorboard.SummaryWriter module.

Note: You don't have to specify current steps, since TensorboardWriter class defined at utils/visualization.py will track current steps.

About

Thesis on applying Language Models to human O-GalNAc prediction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published