Skip to content

Commit

Permalink
Merge pull request #17 from chrulm/master
Browse files Browse the repository at this point in the history
Refactor SynNet
  • Loading branch information
Rocío Mercado authored Oct 12, 2022
2 parents 658628d + cc78ab1 commit 3ccf4fc
Show file tree
Hide file tree
Showing 97 changed files with 5,748 additions and 6,358 deletions.
150 changes: 93 additions & 57 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,22 +1,38 @@
# Code TODOs
TODOs

# Certain unittest files
tests/data/states_0_train.npz
tests/data/steps_0_train.npz
tests/data/rxns_hb.json.gz
tests/data/st_data.json.gz
tests/data/X_act_train.npz
tests/data/y_act_train.npz
tests/data/X_rt1_train.npz
tests/data/y_rt1_train.npz
tests/data/X_rxn_train.npz
tests/data/y_rxn_train.npz
tests/data/X_rt2_train.npz
tests/data/y_rt2_train.npz
tests/gin_supervised_contextpred_pre_trained.pth
tests/backup/
# === custom ===

data/
figures/syntrees/
results/
checkpoints/
oracle/
logs/
tmp/
.dev/
.old/
.notes/
.aliases
*.sh

# === template ===

# Created by https://www.toptal.com/developers/gitignore/api/visualstudiocode,python,jupyternotebooks
# Edit at https://www.toptal.com/developers/gitignore?templates=visualstudiocode,python,jupyternotebooks

### JupyterNotebooks ###
# gitignore template for Jupyter Notebooks
# website: http://jupyter.org/

.ipynb_checkpoints
*/.ipynb_checkpoints/*

# IPython
profile_default/
ipython_config.py

# Remove previous ipynb_checkpoints
# git rm -r .ipynb_checkpoints/

### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand All @@ -39,7 +55,6 @@ parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
Expand Down Expand Up @@ -69,6 +84,7 @@ coverage.xml
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
Expand All @@ -91,17 +107,17 @@ instance/
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
Expand All @@ -110,7 +126,22 @@ ipython_config.py
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
Expand Down Expand Up @@ -147,37 +178,42 @@ dmypy.json
# Pyre type checker
.pyre/

# Vim
*~

# Data
# data/*
.DS_Store
oracle/*
*.json*
*.npy
*logs*
*.gz
*.csv

# test Jupyter Notebook
*.ipynb

# Output files
nohup.out
*.output
*.o
*.out
*.swp
*slurm*
*.sh
*.pth
*.ckpt
*_old*
results
synth_net/params
# pytype static type analyzer
.pytype/

# Old files set to be deleted
tmp/
scripts/oracle
temp.py
# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

### VisualStudioCode ###
.vscode/*
# !.vscode/settings.json
# !.vscode/launch.json
!.vscode/tasks.json
!.vscode/extensions.json
!.vscode/*.code-snippets

# Local History for Visual Studio Code
.history/

# Built Visual Studio Code Extensions
*.vsix

### VisualStudioCode Patch ###a
# Ignore all local history of files
.history
.ionide

# Support for Project snippet scope
.vscode/*.code-snippets

# Ignore code-workspaces
*.code-workspace

# End of https://www.toptal.com/developers/gitignore/api/visualstudiocode,python,jupyternotebooks
161 changes: 161 additions & 0 deletions INSTRUCTIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# Instructions

This documents outlines the process to train SynNet from scratch step-by-step.

> :warning: It is still a WIP.
You can use any set of reaction templates and building blocks, but we will illustrate the process with the *Hartenfeller-Button* reaction templates and *Enamine building blocks*.

*Note*: This project depends on a lot of exact filenames.
For example, one script will save to file, the next will read that file for further processing.
It is not a perfect approach - we are open to feedback.

Let's start.

## Step-by-Step

0. Prepare reaction templates and building blocks.

Extract SMILES from the `.sdf` file from enamine.net.

```shell
python scripts/00-extract-smiles-from-sdf.py \
--input-file="data/assets/building-blocks/enamine-us.sdf" \
--output-file="data/assets/building-blocks/enamine-us-smiles.csv.gz"
```

1. Filter building blocks.

We proprocess the building blocks to identify applicable reactants for each reaction template.
In other words, filter out all building blocks that do not match any reaction template.
There is no need to keep them, as they cannot act as reactant.
In a first step, we match all building blocks with each reaction template.
In a second step, we save all matched building blocks
and a collection of `Reaction`s with their available building blocks.

```bash
python scripts/01-filter-building-blocks.py \
--building-blocks-file "data/assets/building-blocks/enamine-us-smiles.csv.gz" \
--rxn-templates-file "data/assets/reaction-templates/hb.txt" \
--output-bblock-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
--output-rxns-collection-file "data/pre-process/building-blocks-rxns/rxns-hb-enamine-us.json.gz" --verbose
```

> :bulb: All following steps use this matched building blocks <-> reaction template data. You have to specify the correct files for every script to that it can load the right data. It can save some time to store these as environment variables.

2. Pre-compute embeddings

We use the embedding space for the building blocks a lot.
Hence, we pre-compute and store the building blocks.

```bash
python scripts/02-compute-embeddings.py \
--building-blocks-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
--output-file "data/pre-process/embeddings/hb-enamine-embeddings.npy" \
--featurization-fct "fp_256"
```

3. Generate *synthetic trees*

Herein we generate the data used for training the networks.
The data is generated by randomly selecting building blocks, reaction templates and directives to grow a synthetic tree.

```bash
# Generate synthetic trees
python scripts/03-generate-syntrees.py \
--building-blocks-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \
--rxn-templates-file "data/assets/reaction-templates/hb.txt" \
--output-file "data/pre-process/syntrees/synthetic-trees.json.gz" \
--number-syntrees "600000"
```

In a second step, we filter out some synthetic trees to make the data pharmaceutically more interesting.
That is, we filter out trees, whose root node molecule has a QED < 0.5, or randomly with a probability less than 1 - QED/0.5.

```bash
# Filter
python scripts/04-filter-syntrees.py \
--input-file "data/pre-process/syntrees/synthetic-trees.json.gz" \
--output-file "data/pre-process/syntrees/synthetic-trees-filtered.json.gz" \
--verbose
```

Each *synthetic tree* is serializable and so we save all trees in a compressed `.json` file.

5. Split *synthetic trees* into train,valid,test-data

We load the `.json`-file with all *synthetic trees* and
straightforward split it into three files: `{train,test,valid}.json`.
The default split ratio is 6:2:2.

```bash
python scripts/05-split-syntrees.py \
--input-file "data/pre-process/syntrees/synthetic-trees-filtered.json.gz" \
--output-dir "data/pre-process/syntrees/" --verbose
```

6. Featurization

We featurize each *synthetic tree*.
That is, we break down each tree to each iteration step ("Add", "Expand", "Extend", "End") and featurize it.
This results in a "state" vector and a a corresponding "super step" vector.
We call it "super step" here, as it contains all featurized data for all networks.

```bash
python scripts/06-featurize-syntrees.py \
--input-dir "data/pre-process/syntrees/" \
--output-dir "data/featurized/" --verbose
```

This script will load the `{train,valid,test}` data, featurize it, and save it in
- `<output-dir>/{train,valid,test}_states.npz` and
- `<output-dir>/{train,valid,test}_steps.npz`.

The encoders for the molecules must be provided in the script.
A short text summary of the encoders will be saved as well.

7. Split features

Up to this point, we worked with a (featurized) *synthetic tree* as a whole,
now we split it up to into "consumable" input/output data for each of the four networks.
This includes picking the right featurized data from the "super step" vector from the previous step.

```bash
python scripts/07-split-data-for-networks.py \
--input-dir "data/featurized/"
```

This will create 24 new files (3 splits, 4 networks, X + y).
All new files will be saved in `<input-dir>/Xy`.

8. Train the networks

Finally, we can train each of the four networks in `src/synnet/models/` separately. For example:

```bash
python src/synnet/models/act.py
```

After training a new model, you can then use the trained model to make predictions and construct synthetic trees for a list given set of molecules.

You can also perform molecular optimization using a genetic algorithm.

Please refer to the [README.md](./README.md) for inference instructions.

## Auxiallary Scripts

### Visualizing trees

To visualize trees, there is a hacky script that represents *Synthetic Trees* as [mermaid](https://github.com/mermaid-js/mermaid) diagrams.

To demo it:

```bash
python src/synnet/visualize/visualizer.py
```

Still to be implemented: i) target molecule, ii) "end" action

To render the markdown file incl. the diagram directly in VS Code, install the extension [vscode-markdown-mermaid](https://github.com/mjbvz/vscode-markdown-mermaid) and use the built-in markdown preview.

*Info*: If the images of the molecules do not load, edit + save the markdown file anywhere. For example add and delete a character with the preview open. Not sure why this happens.
Loading

0 comments on commit 3ccf4fc

Please sign in to comment.