A ML pipeline for the prediction of specialised metabolites starting substances.
- Clone the repository and move into the directory:
git clone
cd SMPrecursorPredictor
- Create a conda environment and activate it:
conda create -n sm_precursor_predictor python=3.10
conda activate sm_precursor_predictor
- Install the dependencies:
pip install -r requirements.txt
- Install the package:
pip install .
- Create a conda environment and activate it:
conda create -n sm_precursor_predictor python=3.10
conda activate sm_precursor_predictor
pip install SMPrecursorPrediction
Models available:
- Layered FP + Low Variance FS + Ridge Classifier
- Morgan FP + Ridge Classifier
from sm_precursor_predictor import predict_precursors
precursors = predict_precursors(
["[H][C@]89CN(CCc1c([nH]c2ccccc12)[C@@](C(=O)OC)(c3cc4c(cc3OC)N(C)[C@@]5([H])[C@@]"
"(O)(C(=O)OC)[C@H](OC(C)=O)[C@]7(CC)C=CCN6CC[C@]45[C@@]67[H])C8)C[C@](O)(CC)C9",
"COC1=C(C=CC(=C1)C2=C(C(=O)C3=C(C=C(C=C3O2)O)O)O[C@H]4[C@@H]([C@H]([C@H]([C@H](O4)CO)O)O)O)O"],
model="Layered FP + Low Variance FS + Ridge Classifier")
print(precursors)
or
read a csv file with a column of SMILES and a column of IDs and save the predictions in a csv file:
from sm_precursor_predictor import predict_from_csv
predictions = predict_from_csv("path_to_csv",
smiles_field="SMILES",
ids_field="ID",
model="Layered FP + Low Variance FS + Ridge Classifier")
predictions.to_csv("path_to_save_predictions.csv")
This is only possible with one model: Morgan FP + Ridge Classifier.
Example with linalool:
from sm_precursor_predictor import get_prediction_and_explanation
prediction, images, plots = get_prediction_and_explanation(smiles="CC(=CCCC(C)(C=C)O)C", threshold=0.20)
prediction
['Geranyl diphosphate']
images[0]
The final dataset can be found in final_dataset.csv. The LotusDB compounds predictions can be found at predictions.
The exploration of the dataset can be found at dataset_analysis.ipynb.
The AutoML was run using docker. To run using docker you must consider the following files:
Alternatively, if you're rather interested in running the AutoML with a python script, consider the following:
For the analysis of the results refer to the following files:
- Main results
- Alkaloids dataset - Eguchi et al. 2019
- Challenging datasets
- For checking the model interpretability - Monoterpenoid indole alkaloids and others.
The results for the MGCNN can be found at this link.
The formula for mF1 is defined as:
The formula for mRecall is defined as:
The formula for mPrecision is defined as:
where
A similarity matrix between all the Morgan fingerprints of the compounds in the whole dataset was generated to assess their similarity. The similarity function was the Tanimoto similarity index. A t-distributed Stochastic Neighbor Embedding (t-SNE) was created from this matrix to reduce dimensionality and for visualization.
The TPE algorithm optimizes hyperparameter selection by modelling the probability of hyperparameter effectiveness, prioritizing those regions that show promise based on an objective function
Given metric values for two models across
In the context of cross-validation, given two models evaluated across
The figures below show the automatic machine learning model results. The first figure shows the features used during the optimization and the mF1 score on the validation set for each trial. Morgan and layered fingerprints (FP) stood out as the best features.
The figure below shows the models trained and the mF1 scores obtained by each model on the validation set. The ridge classifiers stood out unequivocally.
Figure below shows the F1 scores for each precursor and model.