SMPrecursorPredictor

A ML pipeline for the prediction of specialised metabolites starting substances.

Installation

Manually

Clone the repository and move into the directory:

git clone
cd SMPrecursorPredictor

Create a conda environment and activate it:

conda create -n sm_precursor_predictor python=3.10
conda activate sm_precursor_predictor

Install the dependencies:

pip install -r requirements.txt

Install the package:

pip install .

Pypi

Create a conda environment and activate it:

conda create -n sm_precursor_predictor python=3.10
conda activate sm_precursor_predictor
pip install SMPrecursorPrediction

Making predictions

Models available:

Layered FP + Low Variance FS + Ridge Classifier
Morgan FP + Ridge Classifier

from sm_precursor_predictor import predict_precursors
precursors = predict_precursors(
            ["[H][C@]89CN(CCc1c([nH]c2ccccc12)[C@@](C(=O)OC)(c3cc4c(cc3OC)N(C)[C@@]5([H])[C@@]"
             "(O)(C(=O)OC)[C@H](OC(C)=O)[C@]7(CC)C=CCN6CC[C@]45[C@@]67[H])C8)C[C@](O)(CC)C9",
             "COC1=C(C=CC(=C1)C2=C(C(=O)C3=C(C=C(C=C3O2)O)O)O[C@H]4[C@@H]([C@H]([C@H]([C@H](O4)CO)O)O)O)O"],
             model="Layered FP + Low Variance FS + Ridge Classifier")
print(precursors)

or

read a csv file with a column of SMILES and a column of IDs and save the predictions in a csv file:

from sm_precursor_predictor import predict_from_csv
predictions = predict_from_csv("path_to_csv", 
                               smiles_field="SMILES", 
                               ids_field="ID",
                               model="Layered FP + Low Variance FS + Ridge Classifier")
predictions.to_csv("path_to_save_predictions.csv")

Making and explaining predictions

This is only possible with one model: Morgan FP + Ridge Classifier.

Example with linalool:

from sm_precursor_predictor import get_prediction_and_explanation

prediction, images, plots = get_prediction_and_explanation(smiles="CC(=CCCC(C)(C=C)O)C", threshold=0.20)

prediction

['Geranyl diphosphate']

images[0]

Methods

Data

The final dataset can be found in final_dataset.csv. The LotusDB compounds predictions can be found at predictions.

The exploration of the dataset can be found at dataset_analysis.ipynb.

AutoML

The AutoML was run using docker. To run using docker you must consider the following files:

Dockerfile
run.sh

Alternatively, if you're rather interested in running the AutoML with a python script, consider the following:

train_models.py

Analysis of the results

For the analysis of the results refer to the following files:

Main results
Alkaloids dataset - Eguchi et al. 2019
Challenging datasets
For checking the model interpretability - Monoterpenoid indole alkaloids and others.

The results for the MGCNN can be found at this link.

Metrics

The formula for mF1 is defined as:

$$ \text{mF1} = \frac{1}{N} \sum_{i=1}^{N} \frac{2 \cdot \text{Precision}_i \cdot \text{Recall}_i}{\text{Precision}_i + \text{Recall}_i} $$

The formula for mRecall is defined as:

$$ \text{mRecall} = \frac{1}{N} \sum_{i=1}^{N} \frac{\text{True Positives}_i}{\text{True Positives}_i + \text{False Negatives}_i} $$

The formula for mPrecision is defined as:

$$ \text{mPrecision} = \frac{1}{N} \sum_{i=1}^{N} \frac{\text{True Positives}_i}{\text{True Positives}_i + \text{False Positives}_i} $$

where $N$ denotes the total number of classes, with $\text{Precision}_i$ and $\text{Recall}_i$ corresponding to the precision and recall for class $i$, respectively. $\text{True Positives}_i$ are the true positive predictions for class $i$, and $\text{False Negatives}_i$ are the missed predictions for class $i$. Finally, the $\text{False Positives}_i$ are the wrong positive predictions.

Similarity matrix and t-SNE generation

A similarity matrix between all the Morgan fingerprints of the compounds in the whole dataset was generated to assess their similarity. The similarity function was the Tanimoto similarity index. A t-distributed Stochastic Neighbor Embedding (t-SNE) was created from this matrix to reduce dimensionality and for visualization.

TPE algorithm

The TPE algorithm optimizes hyperparameter selection by modelling the probability of hyperparameter effectiveness, prioritizing those regions that show promise based on an objective function $f(x)$, where $x$ represents the hyperparameters. This function is aimed at maximization. The algorithm divides the hyperparameters into two categories based on a threshold $\gamma$: $l(x)$ for those leading to higher (better) objective function values and $g(x)$ for those leading to lower (worse) values. It then preferentially samples new hyperparameters from $l(x)$, the distribution indicating better performance.

Statistical methods

Given metric values for two models across $n$ tasks, $m_{1i}$ and $m_{2i}$, calculate the differences $d_i = m_{1i} - m_{2i}$ for each task $i$. For these differences, ignore $d_i = 0$ and rank the absolute differences $|d_i|$, assign ranks $R_i$ and compute $W^+ = \sum_{d_i > 0} R_i$ and $W^- = \sum_{d_i < 0} R_i$, the test statistic $W$ is defined as $W = \min(W^+, W^-)$. The p-value is calculated as the probability of observing a value of $W_{\text{ref}}$, determined by a reference distribution under the null hypothesis, as extreme as or more extreme than the observed value ($W$). The null hypothesis is that there are no significant differences between the metric values of the two models. A p-value lower than 0.05 is considered sufficient to reject the null hypothesis.

In the context of cross-validation, given two models evaluated across $n$ tasks and $r$ folds, resulting in performance metrics $m_{Aij}$ and $m_{Bij}$ for models $A$ and $B$ respectively, for each task $i$ and fold $j$, perform the following steps: calculate the differences $d_{ij} = m_{Aij} - m_{Bij}$, rank the absolute differences $|d_{ij}|$, and apply the Wilcoxon Signed-Rank test as explained above.

Results

AutoML results

The figures below show the automatic machine learning model results. The first figure shows the features used during the optimization and the mF1 score on the validation set for each trial. Morgan and layered fingerprints (FP) stood out as the best features.

The figure below shows the models trained and the mF1 scores obtained by each model on the validation set. The ridge classifiers stood out unequivocally.

Figure below shows the F1 scores for each precursor and model.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github/workflows		.github/workflows
examples		examples
imgs		imgs
models_and_datasets		models_and_datasets
src/sm_precursor_predictor		src/sm_precursor_predictor
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_pypi.md		README_pypi.md
feature_importance.png		feature_importance.png
fingerprints.png		fingerprints.png
label_f1_score.png		label_f1_score.png
models.png		models.png
molecule_Geranyl_diphosphate.png		molecule_Geranyl_diphosphate.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SMPrecursorPredictor

Table of contents:

Installation

Manually

Pypi

Making predictions

Making and explaining predictions

Methods

Data

AutoML

Analysis of the results

Metrics

Similarity matrix and t-SNE generation

TPE algorithm

Statistical methods

Results

AutoML results

About

Releases 1

Packages

Contributors 2

Languages

License

jcapels/SMPrecursorPredictor

Folders and files

Latest commit

History

Repository files navigation

SMPrecursorPredictor

Table of contents:

Installation

Manually

Pypi

Making predictions

Making and explaining predictions

Methods

Data

AutoML

Analysis of the results

Metrics

Similarity matrix and t-SNE generation

TPE algorithm

Statistical methods

Results

AutoML results

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages