This documents outlines the process to train SynNet from scratch step-by-step.
You can use any set of reaction templates and building blocks, but we will illustrate the process with the Hartenfeller-Button reaction templates and Enamine building blocks.
Note: This project depends on a lot of exact filenames. For example, one script will save to file, the next will read that file for further processing. It is not a perfect approach - we are open to feedback.
Let's start.
-
Prepare reaction templates and building blocks.
Parse and canonicalize SMILES from the
.sdf
file from enamine.net.python scripts/00-extract-smiles-from-sdf.py \ --input-file="data/assets/building-blocks/enamine-us.sdf" \ --output-file="data/assets/building-blocks/enamine-us.csv.gz"
💡 If you have your own data, save the molecules as SMILES in a dataframe column "SMILES".
-
Filter building blocks + match to reaction templates
We filter the building blocks based on two criteria:
- Building block matches at least one reaction template
- Building block passes some heuristics.
🔈 For the enamine-us & Hartenfeller-Button combination, 138 (0.07%) molecules do not pass the heuristics, and 17358 (8.89%) do not match any reaction template.
In a first step, we filter all building blocks. In a second step, we save
- all matched building blocks
- and a collection of
Reaction
s with their available building blocks as class attributes.
python scripts/01-filter-building-blocks.py \ --building-blocks-file "data/assets/building-blocks/enamine-us.csv.gz" \ --rxn-templates-file "data/assets/reaction-templates/hb.txt" \ --output-bblock-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \ --output-rxns-collection-file "data/pre-process/building-blocks-rxns/rxns-hb-enamine-us.json.gz" --verbose
💡 All following steps use this matched building blocks <-> reaction template data. You have to specify the correct files for every script to that it can load the right data. It can save some time to store these as environment variables.
-
Generate synthetic trees
Herein we generate the data used for training the networks. The data is generated by randomly selecting building blocks, reaction templates and directives to grow a synthetic tree.
# Generate synthetic trees python scripts/03-generate-syntrees.py \ --building-blocks-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \ --rxn-templates-file "data/assets/reaction-templates/hb.txt" \ --output-file "data/pre-process/syntrees/synthetic-trees.json.gz" \ --number-syntrees "600000"
In a second step, we filter out some synthetic trees to make the data pharmaceutically more interesting. In the original submission, the filter is syntrees whose root node molecule has a QED < 0.5, or randomly with a probability less than 1 - QED/0.5.
# Filter python scripts/04-filter-syntrees.py \ --input-file "data/pre-process/syntrees/synthetic-trees.json.gz" \ --output-file "data/pre-process/syntrees/synthetic-trees-filtered.json.gz" \ --filter "qed + random" \ --verbose
Each synthetic tree is serializable and so we save all trees in a compressed
.json
file. -
Train the networks
We can construct our
{train,val,test}
-Dataset from the filtered syntree file. The default split ratio is 6:2:2. (Seesynnet/config.py
)Each Dataset takes one or more featurizers in their constructor. Change them or their parameters if you want to featurize the data for a network differently. The defaults correspond to the original submission.
Finally, we can train each of the four networks in
src/synnet/models/
separately. For example:python src/synnet/models/act.py \ --data "data/pre-process/syntrees/synthetic-trees-filtered.json.gz"
After training a new model, you can then use the trained model to make predictions and construct synthetic trees for a list given set of molecules.
You can also perform molecular optimization using a genetic algorithm.
Please refer to the README.md for inference instructions.
To visualize trees, there is a hacky script that represents Synthetic Trees as Graphviz diagrams and exports them to *.png
.
To demo it:
python src/synnet/visualize/visualizer.py
Still to be implemented: i) target molecule, ii) reaction nodes.