Drug Property Prediction

Drug property prediction (DP) aims to predict molecule properties such as toxicity and side effects, which is a significant task in drug discovery.

Features

Supported models: MolCLR, GraphMVP, MoMu, MolFM and DeepEIK. This is a continuing effort and we are working on further growing the list.
Supported datasets: 8 classification datasets i.e. BBBP, Tox21, ToxCast, SIDER, ClinTox, MUV, HIV and BACE of MoleculeNet.
Supported split: random split, scaffold split and random-scaffold split;
Supproted evaluation: ROC-AUC.

Data Preparation

Download MoleculeNet datasets here, unzip the file, and put the dataset fold under datasets/dp/. You can use the following commands from within OpenBioMed/:

wget http://snap.stanford.edu/gnn-pretrain/data/chem_dataset.zip
unzip chem_dataset.zip
mkdir -p datasets/dp
mv dataset datasets/dp/moleculenet
rm chem_dataset.zip

After downloading and unzipping, you should remove all the processed/ directories of 8 datasets in the dataset/ folder. Otherwise you will get the following error:

RuntimeError: The 'data' object was created by an older version of PyG. If this error occurred while loading an already existing dataset, remove the 'processed/' directory in the dataset's root folder and try again.

Model Preparation

To reproduce DeepEIK, you should install PubMedBERT (uncased) from huggingface and put the checkpoint under ckpts/text_ckpts/.

To reproduce or finetune MolCLR, MoMu and GraphMVP used their trained checkpoints, you can download the checkpoints from:
MolCLR: https://github.com/yuyangw/MolCLR
MoMu: https://github.com/ddz16/MoleculePrediction
GraphMVP: https://github.com/chao1224/GraphMVP

You will need to rename and put the checkpoints at the following paths.

# MolCLR
ckpts/gnn_ckpts/molclr/model.pth
# Momu
ckpts/fusion_ckpts/momu/MoMu-K.ckpt
# GraphMVP
ckpts/gnn_ckpts/graphmvp/pretraining_model.pth

Training and Evaluation

You can run scripts using bash under scripts/aidd/dp/:

scripts/aidd/dp
├── train_molclr.sh		 # running MolCLR on 8 datasets of moleculenet
├── train_graphmvp.sh  # running GraphMVP on 8 datasets of moleculenet
├── train_momu.sh      # running MoMu on 8 datasets of moleculenet
├── train_molfm.sh     # running MolFM on 8 datasets of moleculenet
└── train_deepeik.sh   # running DeepEIK on 8 datasets of moleculenet

Example:

bash scripts/aidd/dp/train_molfm.sh cuda:0   # switch to your on cuda device or cpu

You can also modify the scripts or directly use the following command:

python open_biomed/tasks/mol_task/dp.py \
[--device DEVICE] \									  # gpu device id
[--mode MODE] \											  # training mode, train: train-test
[--config_path CONFIG_PATH] \				  # configuration file, see configs/dp/ for more details
[--dataset DATASET] \								  # datasets name, support MoleculeNet now
[--dataset_path DATASET_PATH] \       # path to the datasets
[--dataset_name DATASET_NAME] \       # name of the dataset
[--init_checkpoint INIT_CHECKPOINT] \ # checkpoint path used for efficient validation
[--param_key PARAM_KEY] \							# key of the checkpoint dict that contains model parameters
[--output_path OUTPUT_PATH] \         # save checkpoint path used for training
[--num_workers NUM_WORKERS] \         # number of workers when loading data
[--patience PATIENCE] \               # number of tolerant epochs for early-stopping
[--weight_decay WEIGHT_DECAY] \       # weight decay, default is 1e-5
[--lr LR] \                           # learning rate, default is 1e-3
[--batch_size BATCH_SIZE] \           # batch size, default is 128
[--epochs EPOCHS] \                   # number of training epochs
[--logging_steps LOGGING_STEPS] \     # steps for printing training information
[--seed SEED]                         # random seed
[--dropout DROPOUT]                   # The dropout ratio of dp model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dp.md

dp.md

Drug Property Prediction

Features

Data Preparation

Model Preparation

Training and Evaluation

Files

dp.md

Latest commit

History

dp.md

File metadata and controls

Drug Property Prediction

Features

Data Preparation

Model Preparation

Training and Evaluation