Skip to content
This repository has been archived by the owner on Jun 22, 2024. It is now read-only.
/ AkAne Public archive

AsymmetriC AutoeNcodEr (ACANE → AkAne). This model is part of MSc Electrochemistry and Battery Technologies project (2022 - 2023), University of Southampton.

License

Notifications You must be signed in to change notification settings

Augus1999/AkAne

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AkAne: bidirectionary model that predicts molecular properties and generates molecular structures

OS python torch black Open in Spaces

Proudly made in University of Southampton in 2023.

Presented in The 20th Nano Bio Info Chemistry Symposium.

model scheme

Web APP

First download the compiled models (torchscript_model.7z) from the release and extract the folder torchscript_model to the same directory of app.py. Then you can run $ python app.py to launch the web app locally.

Trained models

We provide pre-trained autoencoder, prediction models trained on MoleculeNet benchmark (including ESOL, FreeSolv, Lipo, BBBP, BACE, ClinTox, HIV), QM9, PhotoSwitch, AqSolDB, CMC value dataset, and a range of deep eutectic solvents (DES) properties, and 2 generation models that generate protein ligands and DES pairs, respectively.

You can download trained models from the release.

Dataset format

The datasets we used and provided are stored in CSV files. We provide a python class CSVData in akane2/utils/dataset.py to handle these files which require a header with the following tags:

  • smiles (mandatory): the entities under this tag should be molecule SMILES strings. Multiple tags are acceptable.
  • temperature (optional): the temperature in kelvin. Providing more than one this tag won't cause any error but only the last one will be accepted.
  • ratio (optional): molar ratio of each compound in the format of x1:x2:...:xn. Providing more than one this tag won't cause any error but only the last one will be accepted.
  • value (optional): entities under this tag should be molecular properties. Multiple tags are acceptable and in this case you can tell CSVData which value(s) should be loaded by specifying label_idx=[...]. If a property is not defined, leave it empty and the entity will be automatically masked to torch.inf telling the model that this property is unknown.
  • seq (optional): FASTA-style protein sequence. Providing more than one this tag won't cause any error but only the last one will be accepted. NOTE THAT WHEN THIS TAG IS USED, MOLECULAR PROPERTIES (IF PRESENT IN THE FILE) WILL NOT BE LOADED.

These tags are unnecessary to be ordered, e.g.,

smiles,value,value,ratio,smiles

and

smiles,smiles,ratio,value,value

are both okey.

Training thy own model

The following is a guide of how to train your own model.

1. Create your dataset following the dataset format

2. Split your dataset

from akane2.utils import split_dataset

split_ratio = 0.8  # you can use any training:testing ratio from 0 to 1
method = "random"  # another choice is "scaffold"
split_dataset("YOUR_DATASET.csv", split_ratio, method)

This will split your dataset into YOUR_DATASET_train.csv and YOUR_DATASET_test.csv.

3. Load your data

from akane2.utils import CSVData

limit = None  # you can specify how many data-points your want to load, e.g., 1200
label_index = None  # see the above "Dataset format" section
train_set = CSVData("YOUR_DATASET_train.csv", limit, label_index)
test_set = CSVData("YOUR_DATASET_test.csv", limit, label_index)

4. Define your work space

from pathlib import Path

cwd = Path(__file__).parent
workdir = cwd / "YOUR_WORKDIR"  # the directory where checkpoints (if any) will be stored
logdir = cwd / "YOUR_LOG.log"  # where to print the log (you can set it to "None")

5. Define your model

We provide 2 types of models (that is where 2 comes from in the package name): akane2.representation.AkAne (the whole AkAne model) and akane2.representation.Kamome (the indenpendent encoder part, without latent space regularisation, directly connected with the readout block).

  • If you are only interested in property predictions or molecule classifications, we recommend to use only the encoder model:
from akane2.representation import Kamome

num_task = 1  # number of tasks in one output, i.e., if you want to predict [HOMO, LUMO, gap] together then set `num_task = 3`
model = Kamome(num_task=num_task)  #  DON'T FORGET TO SET OTHER IMPORTANT HYPERPARAMETERS
  • If you are going to train a generative or bidirectionary model, please use the whole model:
from akane2.representation import AkAne

num_task = 2
label_mode = "class:2"  # see the comments in `akane2/representation.py` about how to set a proper value
model = AkAne(num_task=num_task, label_mode=label_mode)  #  DON'T FORGET TO SET OTHER IMPORTANT HYPERPARAMETERS

IMPORTANT: Regarding to the hyperparameters (e.g., num_task and label_mode) that DEFINE the functionality of the model, please refer to the comments under each model in representation.py.

6. Train your model

import os
from akane2.utils import train, find_recent_checkpoint

os.environ["NUM_WORKER"] = "4"  # set `num_workers` of torch.utils.data.DataLoader (the default value is min(4, num_cpu_cores) if you remove this line)
chkpt = find_recent_checkpoint(workdir)  # find latest checkpoint (if any)
mode = "predict"  # training mode based on thy desire. Other options are "autoencoder", "classify", and "diffusion"
n_epochs = 1000  # training epochs
batch_size = 5  # define batch-size. Choose thy own value that won't cause `CUDA out of memory` error
save_every = 100  # save a checkpoint every `save_every` epochs (you can set to "None")
train(model, train_set, mode, n_epochs, batch_size, chkpt, logdir, workdir, save_every)

You will find the weight of trained model trained.pt and (if any) checkpoint file(s) state-xxxx.pth under workdir. You can safely delete any checkpoint file if you don't want them. NOTE: In order to get a generative model, it is necessary to first train an autoencoder or finetune a pre-trained autoencoder then train the diffusion model.

7. Test your model (ignore this step if you are training an autoencoder or generation model)

from akane2.utils import test

os.environ["INFERENCE_BATCH_SIZE"] = "20"  # set the inference batch-size that won't cause `CUDA out of memory` error (the default value is 20 if you remove this line)
mode = "prediction"  # testing mode based on thy model. Another choice is "classification"
print(test(model, test_set, mode, workdir/ "train.pt", logdir))

8. Visualise the training loss (optional)

import matplotlib.pyplot as plt
from akane2.utils import extract_log_info

info = extract_log_info(logdir)
plt.plot(info["epoch"], info["loss"])
plt.xlabel("epoch")
plt.ylabel("MSE loss")
plt.yscale("log")
plt.show()

Inferencing

Here are some examples:

import torch
from akane2.representation import AkAne, Kamome
from akane2.utils.graph import smiles2graph, gather
from akane2.utils.token import protein2vec

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

############## define the input to encoder ##############
smiles = "FC1=CC(C(OCC)=O)=CC(F)=C1/N=N/C2=C(F)C=C(C(OCC)=O)C=C2F"
mol = gather([smiles2graph(smiles)])  # get a molecular graph from SMILES
mol["node"] = mol["node"].to(device)
mol["edge"] = mol["edge"].to(device)

############## define the labels to diffusion model ##############
with open("5lqv.fasta", "r") as f:
    fasta = f.readlines()[1]
protein_label = torch.tensor([protein2vec(fasta)], device=device)  # get embedded vectors from FASTA
class_label = torch.tensor([[1]], dtype=torch.long, device=device)

############## load models and inference ##############
model = torch.jit.load("torchscript_model/moleculenet/freesolv.pt").to(device)  # load a compiled Kamome model
result = model(mol)
print(result)

model = torch.jit.load("torchscript_model/protein_ligand.pt").to(device)  # load a compiled generative AkAne model
result = model.generate(size=[1, 20, 1], label=protein_label)  # batch-size=1 mol-size=20 beam-size=1
print(result)

model = AkAne(num_task=2, label_mode="class:2").pretrained("model_akane/hiv_bidirectional.pt").to(device)  # load a bidirectional AkAne model from saved model weight
result = model.inference(mol)
print(result)
result = model.generate(size=[1, 17, 1], label=class_label)  # batch-size=1 mol-size=17 beam-size=1
print(result)

Known issue

  • You cannot compile 2 or more AkAne models (i.e., akane2.representation.AkAne) into TorchScript modules together in one file. We recommend to save the compiled models before hand and load by torch.jit.load(...).
  • Directly loading a TorchScript model or compiling a Python model to TorchScript model via model = torch.jit.script(model) will $\times 10$ slower down the inference. We recommend to freeze the TorchScript model while evaluating by adding an addition line of model = torch.jit.freeze(model.eval()) to eliminate the warmup.

Cite

@mastersthesis{AkAne2023,
title  = {On The Way of Accurate Prediction of Complex Chemical System via General Graph Neural Networks},
author = {Nianze Tao},
year   = {2023},
month  = {September},
school = {The University of Southampton},
type   = {Master's thesis},
note   = {MSc Electrochemistry and Battery Technologies 2022-23},
}

About

AsymmetriC AutoeNcodEr (ACANE → AkAne). This model is part of MSc Electrochemistry and Battery Technologies project (2022 - 2023), University of Southampton.

Topics

Resources

License

Stars

Watchers

Forks

Languages