Merge pull request #17 from chrulm/master

Refactor SynNet
wenhao-gao · Oct 12, 2022 · 3ccf4fc · 3ccf4fc
2 parents 658628d + cc78ab1
commit 3ccf4fc
Show file tree

Hide file tree

Showing 97 changed files with 5,748 additions and 6,358 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,22 +1,38 @@
-# Code TODOs
-TODOs
-
-# Certain unittest files
-tests/data/states_0_train.npz
-tests/data/steps_0_train.npz
-tests/data/rxns_hb.json.gz
-tests/data/st_data.json.gz
-tests/data/X_act_train.npz
-tests/data/y_act_train.npz
-tests/data/X_rt1_train.npz
-tests/data/y_rt1_train.npz
-tests/data/X_rxn_train.npz
-tests/data/y_rxn_train.npz
-tests/data/X_rt2_train.npz
-tests/data/y_rt2_train.npz
-tests/gin_supervised_contextpred_pre_trained.pth
-tests/backup/
+# === custom ===
+
+data/
+figures/syntrees/
+results/
+checkpoints/
+oracle/
+logs/
+tmp/
+.dev/
+.old/
+.notes/
+.aliases
+*.sh
+
+# === template ===
+
+# Created by https://www.toptal.com/developers/gitignore/api/visualstudiocode,python,jupyternotebooks
+# Edit at https://www.toptal.com/developers/gitignore?templates=visualstudiocode,python,jupyternotebooks
+
+### JupyterNotebooks ###
+# gitignore template for Jupyter Notebooks
+# website: http://jupyter.org/
+
+.ipynb_checkpoints
+*/.ipynb_checkpoints/*
 
+# IPython
+profile_default/
+ipython_config.py
+
+# Remove previous ipynb_checkpoints
+#   git rm -r .ipynb_checkpoints/
+
+### Python ###
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
@@ -39,7 +55,6 @@ parts/
 sdist/
 var/
 wheels/
-pip-wheel-metadata/
 share/python-wheels/
 *.egg-info/
 .installed.cfg
@@ -69,6 +84,7 @@ coverage.xml
 *.py,cover
 .hypothesis/
 .pytest_cache/
+cover/
 
 # Translations
 *.mo
@@ -91,17 +107,17 @@ instance/
 docs/_build/
 
 # PyBuilder
+.pybuilder/
 target/
 
 # Jupyter Notebook
-.ipynb_checkpoints
 
 # IPython
-profile_default/
-ipython_config.py
 
 # pyenv
-.python-version
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
 
 # pipenv
 #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
@@ -110,7 +126,22 @@ ipython_config.py
 #   install all needed dependencies.
 #Pipfile.lock
 
-# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
 __pypackages__/
 
 # Celery stuff
@@ -147,37 +178,42 @@ dmypy.json
 # Pyre type checker
 .pyre/
 
-# Vim
-*~
-
-# Data
-# data/*
-.DS_Store
-oracle/*
-*.json*
-*.npy
-*logs*
-*.gz
-*.csv
-
-# test Jupyter Notebook
-*.ipynb
-
-# Output files
-nohup.out
-*.output
-*.o
-*.out
-*.swp
-*slurm*
-*.sh
-*.pth
-*.ckpt
-*_old*
-results
-synth_net/params
+# pytype static type analyzer
+.pytype/
 
-# Old files set to be deleted
-tmp/
-scripts/oracle
-temp.py
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+
+### VisualStudioCode ###
+.vscode/*
+# !.vscode/settings.json
+# !.vscode/launch.json
+!.vscode/tasks.json
+!.vscode/extensions.json
+!.vscode/*.code-snippets
+
+# Local History for Visual Studio Code
+.history/
+
+# Built Visual Studio Code Extensions
+*.vsix
+
+### VisualStudioCode Patch ###a
+# Ignore all local history of files
+.history
+.ionide
+
+# Support for Project snippet scope
+.vscode/*.code-snippets
+
+# Ignore code-workspaces
+*.code-workspace
+
+# End of https://www.toptal.com/developers/gitignore/api/visualstudiocode,python,jupyternotebooks
diff --git a/INSTRUCTIONS.md b/INSTRUCTIONS.md
@@ -0,0 +1,161 @@
+# Instructions
+
+This documents outlines the process to train SynNet from scratch step-by-step.
+
+> :warning: It is still a WIP.
+
+You can use any set of reaction templates and building blocks, but we will illustrate the process with the *Hartenfeller-Button* reaction templates and *Enamine building blocks*.
+
+*Note*: This project depends on a lot of exact filenames.
+For example, one script will save to file, the next will read that file for further processing.
+It is not a perfect approach - we are open to feedback.
+
+Let's start.
+
+## Step-by-Step
+
+0. Prepare reaction templates and building blocks.
+
+    Extract SMILES from the `.sdf` file from enamine.net.
+
+    ```shell
+    python scripts/00-extract-smiles-from-sdf.py \
+        --input-file="data/assets/building-blocks/enamine-us.sdf" \
+        --output-file="data/assets/building-blocks/enamine-us-smiles.csv.gz"
+    ```
+
+1. Filter building blocks.
+
+    We proprocess the building blocks to identify applicable reactants for each reaction template.
+    In other words, filter out all building blocks that do not match any reaction template.
+    There is no need to keep them, as they cannot act as reactant.
+    In a first step, we match all building blocks with each reaction template.
+    In a second step, we save all matched building blocks
+    and a collection of `Reaction`s with their available building blocks.
+
+    ```bash
+    python scripts/01-filter-building-blocks.py \
+        --building-blocks-file "data/assets/building-blocks/enamine-us-smiles.csv.gz" \
+        --rxn-templates-file "data/assets/reaction-templates/hb.txt" \
+        --output-bblock-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
+        --output-rxns-collection-file "data/pre-process/building-blocks-rxns/rxns-hb-enamine-us.json.gz" --verbose
+    ```
+
+    > :bulb: All following steps use this matched building blocks <-> reaction template data. You have to specify the correct files for every script to that it can load the right data. It can save some time to store these as environment variables.
+
+2. Pre-compute embeddings
+
+    We use the embedding space for the building blocks a lot.
+    Hence, we pre-compute and store the building blocks.
+
+    ```bash
+    python scripts/02-compute-embeddings.py \
+        --building-blocks-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
+        --output-file "data/pre-process/embeddings/hb-enamine-embeddings.npy" \
+        --featurization-fct "fp_256"
+    ```
+
+3. Generate *synthetic trees*
+
+    Herein we generate the data used for training the networks.
+    The data is generated by randomly selecting building blocks, reaction templates and directives to grow a synthetic tree.
+
+    ```bash
+    # Generate synthetic trees
+    python scripts/03-generate-syntrees.py \
+        --building-blocks-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
+        --rxn-templates-file "data/assets/reaction-templates/hb.txt" \
+        --output-file "data/pre-process/syntrees/synthetic-trees.json.gz" \
+        --number-syntrees "600000"
+    ```
+
+    In a second step, we filter out some synthetic trees to make the data pharmaceutically more interesting.
+    That is, we filter out trees, whose root node molecule has a QED < 0.5, or randomly with a probability less than 1 - QED/0.5.
+
+    ```bash
+    # Filter
+    python scripts/04-filter-syntrees.py \
+        --input-file "data/pre-process/syntrees/synthetic-trees.json.gz" \
+        --output-file "data/pre-process/syntrees/synthetic-trees-filtered.json.gz" \
+        --verbose
+    ```
+
+    Each *synthetic tree* is serializable and so we save all trees in a compressed `.json` file.
+
+5. Split *synthetic trees* into train,valid,test-data
+
+    We load the `.json`-file with all *synthetic trees* and
+    straightforward split it into three files: `{train,test,valid}.json`.
+    The default split ratio is 6:2:2.
+
+    ```bash
+    python scripts/05-split-syntrees.py \
+            --input-file "data/pre-process/syntrees/synthetic-trees-filtered.json.gz" \
+            --output-dir "data/pre-process/syntrees/" --verbose
+    ```
+
+6. Featurization
+
+   We featurize each *synthetic tree*.
+   That is, we break down each tree to each iteration step ("Add", "Expand", "Extend", "End") and featurize it.
+   This results in a "state" vector and a a corresponding "super step" vector.
+   We call it "super step" here, as it contains all featurized data for all networks.
+
+    ```bash
+    python scripts/06-featurize-syntrees.py \
+        --input-dir "data/pre-process/syntrees/" \
+        --output-dir "data/featurized/" --verbose
+    ```
+
+    This script will load the `{train,valid,test}` data, featurize it, and save it in
+      - `<output-dir>/{train,valid,test}_states.npz` and
+      - `<output-dir>/{train,valid,test}_steps.npz`.
+
+    The encoders for the molecules must be provided in the script.
+    A short text summary of the encoders will be saved as well.
+
+7. Split features
+
+    Up to this point, we worked with a (featurized) *synthetic tree* as a whole,
+    now we split it up to into "consumable" input/output data for each of the four networks.
+    This includes picking the right featurized data from the "super step" vector from the previous step.
+
+    ```bash
+    python scripts/07-split-data-for-networks.py \
+        --input-dir "data/featurized/"
+    ```
+
+    This will create 24 new files (3 splits, 4 networks, X + y).
+    All new files will be saved in `<input-dir>/Xy`.
+
+8. Train the networks
+
+    Finally, we can train each of the four networks in `src/synnet/models/` separately. For example:
+
+    ```bash
+    python src/synnet/models/act.py
+    ```
+
+After training a new model, you can then use the trained model to make predictions and construct synthetic trees for a list given set of molecules.
+
+You can also perform molecular optimization using a genetic algorithm.
+
+Please refer to the [README.md](./README.md) for inference instructions.
+
+## Auxiallary Scripts
+
+### Visualizing trees
+
+To visualize trees, there is a hacky script that represents *Synthetic Trees* as [mermaid](https://github.com/mermaid-js/mermaid) diagrams.
+
+To demo it:
+
+```bash
+python src/synnet/visualize/visualizer.py
+```
+
+Still to be implemented: i) target molecule, ii) "end" action
+
+To render the markdown file incl. the diagram directly in VS Code, install the extension [vscode-markdown-mermaid](https://github.com/mjbvz/vscode-markdown-mermaid) and use the built-in markdown preview.
+
+*Info*: If the images of the molecules do not load, edit + save the markdown file anywhere. For example add and delete a character with the preview open. Not sure why this happens.