Presented in The 20th Nano Bio Info Chemistry Symposium.
First download the compiled models (torchscript_model.7z
) from the release and extract the folder torchscript_model
to the same directory of app.py
. Then you can run $ python app.py
to launch the web app locally.
We provide pre-trained autoencoder, prediction models trained on MoleculeNet benchmark (including ESOL, FreeSolv, Lipo, BBBP, BACE, ClinTox, HIV), QM9, PhotoSwitch, AqSolDB, CMC value dataset, and a range of deep eutectic solvents (DES) properties, and 2 generation models that generate protein ligands and DES pairs, respectively.
You can download trained models from the release.
The datasets we used and provided are stored in CSV files. We provide a python class CSVData
in akane2/utils/dataset.py to handle these files which require a header with the following tags:
- smiles (mandatory): the entities under this tag should be molecule SMILES strings. Multiple tags are acceptable.
- temperature (optional): the temperature in kelvin. Providing more than one this tag won't cause any error but only the last one will be accepted.
- ratio (optional): molar ratio of each compound in the format of
x1:x2:...:xn
. Providing more than one this tag won't cause any error but only the last one will be accepted. - value (optional): entities under this tag should be molecular properties. Multiple tags are acceptable and in this case you can tell
CSVData
which value(s) should be loaded by specifyinglabel_idx=[...]
. If a property is not defined, leave it empty and the entity will be automatically masked totorch.inf
telling the model that this property is unknown. - seq (optional): FASTA-style protein sequence. Providing more than one this tag won't cause any error but only the last one will be accepted. NOTE THAT WHEN THIS TAG IS USED, MOLECULAR PROPERTIES (IF PRESENT IN THE FILE) WILL NOT BE LOADED.
These tags are unnecessary to be ordered, e.g.,
smiles,value,value,ratio,smiles
and
smiles,smiles,ratio,value,value
are both okey.
The following is a guide of how to train your own model.
from akane2.utils import split_dataset
split_ratio = 0.8 # you can use any training:testing ratio from 0 to 1
method = "random" # another choice is "scaffold"
split_dataset("YOUR_DATASET.csv", split_ratio, method)
This will split your dataset into YOUR_DATASET_train.csv
and YOUR_DATASET_test.csv
.
from akane2.utils import CSVData
limit = None # you can specify how many data-points your want to load, e.g., 1200
label_index = None # see the above "Dataset format" section
train_set = CSVData("YOUR_DATASET_train.csv", limit, label_index)
test_set = CSVData("YOUR_DATASET_test.csv", limit, label_index)
from pathlib import Path
cwd = Path(__file__).parent
workdir = cwd / "YOUR_WORKDIR" # the directory where checkpoints (if any) will be stored
logdir = cwd / "YOUR_LOG.log" # where to print the log (you can set it to "None")
We provide 2 types of models (that is where 2 comes from in the package name): akane2.representation.AkAne
(the whole AkAne model) and akane2.representation.Kamome
(the indenpendent encoder part, without latent space regularisation, directly connected with the readout block).
- If you are only interested in property predictions or molecule classifications, we recommend to use only the encoder model:
from akane2.representation import Kamome
num_task = 1 # number of tasks in one output, i.e., if you want to predict [HOMO, LUMO, gap] together then set `num_task = 3`
model = Kamome(num_task=num_task) # DON'T FORGET TO SET OTHER IMPORTANT HYPERPARAMETERS
- If you are going to train a generative or bidirectionary model, please use the whole model:
from akane2.representation import AkAne
num_task = 2
label_mode = "class:2" # see the comments in `akane2/representation.py` about how to set a proper value
model = AkAne(num_task=num_task, label_mode=label_mode) # DON'T FORGET TO SET OTHER IMPORTANT HYPERPARAMETERS
IMPORTANT: Regarding to the hyperparameters (e.g., num_task
and label_mode
) that DEFINE the functionality of the model, please refer to the comments under each model in representation.py.
import os
from akane2.utils import train, find_recent_checkpoint
os.environ["NUM_WORKER"] = "4" # set `num_workers` of torch.utils.data.DataLoader (the default value is min(4, num_cpu_cores) if you remove this line)
chkpt = find_recent_checkpoint(workdir) # find latest checkpoint (if any)
mode = "predict" # training mode based on thy desire. Other options are "autoencoder", "classify", and "diffusion"
n_epochs = 1000 # training epochs
batch_size = 5 # define batch-size. Choose thy own value that won't cause `CUDA out of memory` error
save_every = 100 # save a checkpoint every `save_every` epochs (you can set to "None")
train(model, train_set, mode, n_epochs, batch_size, chkpt, logdir, workdir, save_every)
You will find the weight of trained model trained.pt
and (if any) checkpoint file(s) state-xxxx.pth
under workdir. You can safely delete any checkpoint file if you don't want them. NOTE: In order to get a generative model, it is necessary to first train an autoencoder or finetune a pre-trained autoencoder then train the diffusion model.
from akane2.utils import test
os.environ["INFERENCE_BATCH_SIZE"] = "20" # set the inference batch-size that won't cause `CUDA out of memory` error (the default value is 20 if you remove this line)
mode = "prediction" # testing mode based on thy model. Another choice is "classification"
print(test(model, test_set, mode, workdir/ "train.pt", logdir))
import matplotlib.pyplot as plt
from akane2.utils import extract_log_info
info = extract_log_info(logdir)
plt.plot(info["epoch"], info["loss"])
plt.xlabel("epoch")
plt.ylabel("MSE loss")
plt.yscale("log")
plt.show()
Here are some examples:
import torch
from akane2.representation import AkAne, Kamome
from akane2.utils.graph import smiles2graph, gather
from akane2.utils.token import protein2vec
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
############## define the input to encoder ##############
smiles = "FC1=CC(C(OCC)=O)=CC(F)=C1/N=N/C2=C(F)C=C(C(OCC)=O)C=C2F"
mol = gather([smiles2graph(smiles)]) # get a molecular graph from SMILES
mol["node"] = mol["node"].to(device)
mol["edge"] = mol["edge"].to(device)
############## define the labels to diffusion model ##############
with open("5lqv.fasta", "r") as f:
fasta = f.readlines()[1]
protein_label = torch.tensor([protein2vec(fasta)], device=device) # get embedded vectors from FASTA
class_label = torch.tensor([[1]], dtype=torch.long, device=device)
############## load models and inference ##############
model = torch.jit.load("torchscript_model/moleculenet/freesolv.pt").to(device) # load a compiled Kamome model
result = model(mol)
print(result)
model = torch.jit.load("torchscript_model/protein_ligand.pt").to(device) # load a compiled generative AkAne model
result = model.generate(size=[1, 20, 1], label=protein_label) # batch-size=1 mol-size=20 beam-size=1
print(result)
model = AkAne(num_task=2, label_mode="class:2").pretrained("model_akane/hiv_bidirectional.pt").to(device) # load a bidirectional AkAne model from saved model weight
result = model.inference(mol)
print(result)
result = model.generate(size=[1, 17, 1], label=class_label) # batch-size=1 mol-size=17 beam-size=1
print(result)
- You cannot compile 2 or more AkAne models (i.e.,
akane2.representation.AkAne
) into TorchScript modules together in one file. We recommend to save the compiled models before hand and load bytorch.jit.load(...)
. - Directly loading a TorchScript model or compiling a Python model to TorchScript model via
model = torch.jit.script(model)
will$\times 10$ slower down the inference. We recommend to freeze the TorchScript model while evaluating by adding an addition line ofmodel = torch.jit.freeze(model.eval())
to eliminate the warmup.
@mastersthesis{AkAne2023,
title = {On The Way of Accurate Prediction of Complex Chemical System via General Graph Neural Networks},
author = {Nianze Tao},
year = {2023},
month = {September},
school = {The University of Southampton},
type = {Master's thesis},
note = {MSc Electrochemistry and Battery Technologies 2022-23},
}