Skip to content

Latest commit

 

History

History
111 lines (76 loc) · 4.72 KB

INSTRUCTIONS.md

File metadata and controls

111 lines (76 loc) · 4.72 KB

Instructions

This documents outlines the process to train SynNet from scratch step-by-step.

You can use any set of reaction templates and building blocks, but we will illustrate the process with the Hartenfeller-Button reaction templates and Enamine building blocks.

Note: This project depends on a lot of exact filenames. For example, one script will save to file, the next will read that file for further processing. It is not a perfect approach - we are open to feedback.

Let's start.

Step-by-Step

  1. Prepare reaction templates and building blocks.

    Parse and canonicalize SMILES from the .sdf file from enamine.net.

    python scripts/00-extract-smiles-from-sdf.py \
        --input-file="data/assets/building-blocks/enamine-us.sdf" \
        --output-file="data/assets/building-blocks/enamine-us.csv.gz"

    💡 If you have your own data, save the molecules as SMILES in a dataframe column "SMILES".

  2. Filter building blocks + match to reaction templates

    We filter the building blocks based on two criteria:

    1. Building block matches at least one reaction template
    2. Building block passes some heuristics.

    🔈 For the enamine-us & Hartenfeller-Button combination, 138 (0.07%) molecules do not pass the heuristics, and 17358 (8.89%) do not match any reaction template.

    In a first step, we filter all building blocks. In a second step, we save

    • all matched building blocks
    • and a collection of Reactions with their available building blocks as class attributes.
    python scripts/01-filter-building-blocks.py \
        --building-blocks-file "data/assets/building-blocks/enamine-us.csv.gz" \
        --rxn-templates-file "data/assets/reaction-templates/hb.txt" \
        --output-bblock-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
        --output-rxns-collection-file "data/pre-process/building-blocks-rxns/rxns-hb-enamine-us.json.gz" --verbose

    💡 All following steps use this matched building blocks <-> reaction template data. You have to specify the correct files for every script to that it can load the right data. It can save some time to store these as environment variables.

  3. Generate synthetic trees

    Herein we generate the data used for training the networks. The data is generated by randomly selecting building blocks, reaction templates and directives to grow a synthetic tree.

    # Generate synthetic trees
    python scripts/03-generate-syntrees.py \
        --building-blocks-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
        --rxn-templates-file "data/assets/reaction-templates/hb.txt" \
        --output-file "data/pre-process/syntrees/synthetic-trees.json.gz" \
        --number-syntrees "600000"

    In a second step, we filter out some synthetic trees to make the data pharmaceutically more interesting. In the original submission, the filter is syntrees whose root node molecule has a QED < 0.5, or randomly with a probability less than 1 - QED/0.5.

    # Filter
    python scripts/04-filter-syntrees.py \
        --input-file "data/pre-process/syntrees/synthetic-trees.json.gz" \
        --output-file "data/pre-process/syntrees/synthetic-trees-filtered.json.gz" \
        --filter "qed + random" \
        --verbose

    Each synthetic tree is serializable and so we save all trees in a compressed .json file.

  4. Train the networks

    We can construct our {train,val,test}-Dataset from the filtered syntree file. The default split ratio is 6:2:2. (See synnet/config.py)

    Each Dataset takes one or more featurizers in their constructor. Change them or their parameters if you want to featurize the data for a network differently. The defaults correspond to the original submission.

    Finally, we can train each of the four networks in src/synnet/models/ separately. For example:

    python src/synnet/models/act.py \
      --data "data/pre-process/syntrees/synthetic-trees-filtered.json.gz"

After training a new model, you can then use the trained model to make predictions and construct synthetic trees for a list given set of molecules.

You can also perform molecular optimization using a genetic algorithm.

Please refer to the README.md for inference instructions.

Auxiallary Scripts

Visualizing trees

To visualize trees, there is a hacky script that represents Synthetic Trees as Graphviz diagrams and exports them to *.png.

To demo it:

python src/synnet/visualize/visualizer.py

Still to be implemented: i) target molecule, ii) reaction nodes.