Skip to content

Latest commit

 

History

History
executable file
·
225 lines (197 loc) · 9.62 KB

README.md

File metadata and controls

executable file
·
225 lines (197 loc) · 9.62 KB

Moloi

About

Comparison of the methods of molecular representation in the task of classifying the activity of a molecule (in SMILES notation) in drug discovery experiments and searching for the optimal combination of physical and structural descriptors.

Descriptors:

  • MACCS (maccs)
  • morgan/ECFP (morgan)
  • RDKit (rdkit)
  • mordred (mordred)
  • spectrophore (spectrophore)

Models:

  • KNN (knn)
  • Logistic regression (lr)
  • RandomForestClassifier (rf)
  • SVC (svc)
  • XGBClassifier (xgb)
  • Isolation Forest (if)
  • FCNN (fcnn)
  • MLP (mlp, mlp_sklearn)

Splits:

  • random
  • stratified
  • scaffold
  • cluster

Goals

  • workbench for tables of experiments with large number of easily accessible parameters and hyperparameters
  • featurization, effective work with processed datasets, feature combinations
  • feature importance
  • feature selection
  • Seq2Seq for transfer learning

Table of Contents

  1. Results
  2. Install
  3. Usage
  4. Input
  5. Output
  6. Datasets
  7. Data config
  8. Model config
  9. Single experiment
  10. Experiments table
  11. Utilities
  12. Citation

Results

Install

with Conda

  • sh setup.sh

    or

  • Conda (https://www.anaconda.com/download/#linux)

  • conda install --file requirements

  • conda install -c conda-forge xgboost

  • conda install -c openbabel openbabel

  • conda install -c rdkit rdkit

  • conda install -c mordred-descriptor mordred

  • Python3: pip install configparser

  • Python2: pip install ConfigParser

  • pip install argparse

with Pip

  • pip install git+git://github.com/DentonJC/virtual_screening

    or

  • Packages from requirements

  • pip install xgboost

  • RDKit (https://github.com/rdkit/rdkit)

  • pip install mordred

  • Python3: pip install configparser

  • Python2: pip install ConfigParser

  • pip install argparse

Usage

usage: model data section [-h] [--select_model SELECT_MODEL]
                  [--data_config DATA_CONFIG] [--section SECTION]
                  [--load_model LOAD_MODEL]
                  [--descriptors DESCRIPTORS] [--output OUTPUT]
                  [--model_config MODEL_CONFIG] [--n_bits N_BITS]
                  [--n_cv N_CV] [--n_iter N_ITER] [--n_jobs N_JOBS]
                  [--patience PATIENCE] [--gridsearch]
                  [--metric {accuracy,roc_auc,f1,matthews}]
                  [--split_type {stratified,scaffold,random,cluster}]
                  [--split_size SPLIT_SIZE] [--targets TARGETS]
                  [--experiments_file EXPERIMENTS_FILE]

optional arguments:
-h, --help            show this help message and exit
--select_model SELECT_MODEL
                name of the model, select from list in README
--data_config DATA_CONFIG
                path to dataset config file
--section SECTION     name of section in model config file
--load_model LOAD_MODEL
                path to model .sav
--descriptors DESCRIPTORS
                descriptor of molecules
--output OUTPUT       path to output directory
--model_config MODEL_CONFIG
                path to config file
--n_bits N_BITS       number of bits in Morgan fingerprint
--n_cv N_CV           number of splits in RandomizedSearchCV
--n_iter N_ITER       number of iterations in RandomizedSearchCV
--n_jobs N_JOBS       number of jobs
--patience PATIENCE, -p PATIENCE
                patience of fit
--gridsearch, -g      use gridsearch
--metric {accuracy,roc_auc,f1,matthews}
                metric for RandomizedSearchCV
--split_type {stratified,scaffold,random,cluster}
                type of train-test split
--split_size SPLIT_SIZE     size of test and valid splits
--targets TARGETS, -t TARGETS
                set number of target column
--experiments_file EXPERIMENTS_FILE, -e EXPERIMENTS_FILE
                where to write results of experiments

Single experiment

  1. Create or use a script from /moloi/bin/
  2. Run script.py with Python

Processing the experiment table

Attention! Nested parallelization!

  1. Default set:
  • run.py: n_jobs = 1
  • experiments_table.csv: n_jobs = -1
    Only for evaluation:
  • run.py: n_jobs = -1
  • experiments_table.csv: n_jobs = 1

  1. It is impossible to get RDKit and Mordred descriptors for some molecules, so the first experiment must be done with RDKit and Mordred descriptors (if you want to use them in the following experiments) to exclude the lost molecules from the dataset and other descriptors.

  2. Fill in the table with parameters of experiments (examples in /etc, False = empty cell), UTF-8

  3. Run run.py with Python

  4. Experiments will be performed line by line with parameters from filled columns and with output to the result columns

Example input

python moloi/moloi.py --model_config '/data/model_configs/configs.ini' --descriptors ['rdkit', 'morgan','mordred', 'maccs'] --n_bits 2048 --n_cv 5 -p 100 -g --n_iter 300 --metric 'roc_auc' --split_type 'scaffold' --split_s 0.1 --select_model 'rf' --data_config '/data/data_configs/bace.ini' --section 'RF' -e 'etc/experiments_bace.csv' -t 0

Example output

Script adderss: run.py
Descriptors: ['rdkit', 'morgan', 'mordred', 'maccs']
n_bits: 2048
Config file: /data/model_configs/configs.ini
Section: RF
Grid search
Load train data
Load test data
Load val data
Data loaded
x_train shape: (1207, 4239)
x_test shape: (152, 4239)
x_val shape: (154, 4239)
y_train shape: (1207, 1)
y_test shape: (152, 1)
y_val shape: (154, 1)
GRID SEARCH
GRIDSEARCH FIT
MODEL FIT
EVALUATE
Accuracy test: 70.39%
0:07:37.644208
Creating report
Report complete, you can see it in the results folder
Results path: /tmp/2018-05-27_15:45:04_RF_['rdkit','morgan','mordred','maccs']70.395/
Done

Report

After running the first experiment, the /tmp folder with the subfolders of the experiments will be created. In the experiment folder are:

  1. models/: copies of the folder with models
  2. run.py: copy of the experiment script
  3. results/: folder with model checkpoints (if Keras model)
  4. log: log of the experiment
  5. model.sav: the model
  6. addresses: text file with the address of model - its content allows to load the model in the experiment table
  7. n_cv: a text file with cross-validation indices
  8. gridsearch.csv: history of gridsearch (if gridsearch)
  9. y_pred_test.csv, y_pred_val.csv: predicted test and validation values
  10. img/: ROC AUC plot (if possible)
  11. img/: feature importance plot
  12. img/: gridsearch plots (one picture for each hyperparameter)
  13. img/: result plots - 3 plots shows t-SNE with correctly and incorrectly classified points (for all classes, for the positive class and for the negative class)
  14. report 70.39.pdf (accuracy in name): report with information about the experiment

Citation

  • Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.
  • Stéfan van der Walt, S. Chris Colbert and Gaël Varoquaux. The NumPy Array: A Structure for Efficient Numerical Computation, Computing in Science & Engineering, 13, 22-30 (2011), DOI:10.1109/MCSE.2011.37 (publisher link)
  • Travis E. Oliphant. Python for Scientific Computing, Computing in Science & Engineering, 9, 10-20 (2007), DOI:10.1109/MCSE.2007.58 (publisher link)
  • K. Jarrod Millman and Michael Aivazis. Python for Scientists and Engineers, Computing in Science & Engineering, 13, 9-12 (2011), DOI:10.1109/MCSE.2011.36 (publisher link)
  • Fernando Pérez and Brian E. Granger. IPython: A System for Interactive Scientific Computing, Computing in Science & Engineering, 9, 21-29 (2007), DOI:10.1109/MCSE.2007.53 (publisher link)
  • John D. Hunter. Matplotlib: A 2D Graphics Environment, Computing in Science & Engineering, 9, 90-95 (2007), DOI:10.1109/MCSE.2007.55 (publisher link)
  • Wes McKinney. Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51-56 (2010) (publisher link)
  • O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.
  • https://github.com/gmum/ananas/blob/master/fingerprints/_desc_rdkit.py
  • RDKit: Open-source cheminformatics; http://www.rdkit.org
  • Keras (2015), Chollet et al., https://github.com/fchollet/keras