-
Notifications
You must be signed in to change notification settings - Fork 22
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #17 from chrulm/master
Refactor SynNet
- Loading branch information
Showing
97 changed files
with
5,748 additions
and
6,358 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,161 @@ | ||
# Instructions | ||
|
||
This documents outlines the process to train SynNet from scratch step-by-step. | ||
|
||
> :warning: It is still a WIP. | ||
You can use any set of reaction templates and building blocks, but we will illustrate the process with the *Hartenfeller-Button* reaction templates and *Enamine building blocks*. | ||
|
||
*Note*: This project depends on a lot of exact filenames. | ||
For example, one script will save to file, the next will read that file for further processing. | ||
It is not a perfect approach - we are open to feedback. | ||
|
||
Let's start. | ||
|
||
## Step-by-Step | ||
|
||
0. Prepare reaction templates and building blocks. | ||
|
||
Extract SMILES from the `.sdf` file from enamine.net. | ||
|
||
```shell | ||
python scripts/00-extract-smiles-from-sdf.py \ | ||
--input-file="data/assets/building-blocks/enamine-us.sdf" \ | ||
--output-file="data/assets/building-blocks/enamine-us-smiles.csv.gz" | ||
``` | ||
|
||
1. Filter building blocks. | ||
|
||
We proprocess the building blocks to identify applicable reactants for each reaction template. | ||
In other words, filter out all building blocks that do not match any reaction template. | ||
There is no need to keep them, as they cannot act as reactant. | ||
In a first step, we match all building blocks with each reaction template. | ||
In a second step, we save all matched building blocks | ||
and a collection of `Reaction`s with their available building blocks. | ||
|
||
```bash | ||
python scripts/01-filter-building-blocks.py \ | ||
--building-blocks-file "data/assets/building-blocks/enamine-us-smiles.csv.gz" \ | ||
--rxn-templates-file "data/assets/reaction-templates/hb.txt" \ | ||
--output-bblock-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \ | ||
--output-rxns-collection-file "data/pre-process/building-blocks-rxns/rxns-hb-enamine-us.json.gz" --verbose | ||
``` | ||
|
||
> :bulb: All following steps use this matched building blocks <-> reaction template data. You have to specify the correct files for every script to that it can load the right data. It can save some time to store these as environment variables. | ||
|
||
2. Pre-compute embeddings | ||
|
||
We use the embedding space for the building blocks a lot. | ||
Hence, we pre-compute and store the building blocks. | ||
|
||
```bash | ||
python scripts/02-compute-embeddings.py \ | ||
--building-blocks-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \ | ||
--output-file "data/pre-process/embeddings/hb-enamine-embeddings.npy" \ | ||
--featurization-fct "fp_256" | ||
``` | ||
|
||
3. Generate *synthetic trees* | ||
|
||
Herein we generate the data used for training the networks. | ||
The data is generated by randomly selecting building blocks, reaction templates and directives to grow a synthetic tree. | ||
|
||
```bash | ||
# Generate synthetic trees | ||
python scripts/03-generate-syntrees.py \ | ||
--building-blocks-file "data/pre-process/building-blocks-rxns/bblocks-enamine-us.csv.gz" \ | ||
--rxn-templates-file "data/assets/reaction-templates/hb.txt" \ | ||
--output-file "data/pre-process/syntrees/synthetic-trees.json.gz" \ | ||
--number-syntrees "600000" | ||
``` | ||
|
||
In a second step, we filter out some synthetic trees to make the data pharmaceutically more interesting. | ||
That is, we filter out trees, whose root node molecule has a QED < 0.5, or randomly with a probability less than 1 - QED/0.5. | ||
|
||
```bash | ||
# Filter | ||
python scripts/04-filter-syntrees.py \ | ||
--input-file "data/pre-process/syntrees/synthetic-trees.json.gz" \ | ||
--output-file "data/pre-process/syntrees/synthetic-trees-filtered.json.gz" \ | ||
--verbose | ||
``` | ||
|
||
Each *synthetic tree* is serializable and so we save all trees in a compressed `.json` file. | ||
|
||
5. Split *synthetic trees* into train,valid,test-data | ||
|
||
We load the `.json`-file with all *synthetic trees* and | ||
straightforward split it into three files: `{train,test,valid}.json`. | ||
The default split ratio is 6:2:2. | ||
|
||
```bash | ||
python scripts/05-split-syntrees.py \ | ||
--input-file "data/pre-process/syntrees/synthetic-trees-filtered.json.gz" \ | ||
--output-dir "data/pre-process/syntrees/" --verbose | ||
``` | ||
|
||
6. Featurization | ||
|
||
We featurize each *synthetic tree*. | ||
That is, we break down each tree to each iteration step ("Add", "Expand", "Extend", "End") and featurize it. | ||
This results in a "state" vector and a a corresponding "super step" vector. | ||
We call it "super step" here, as it contains all featurized data for all networks. | ||
|
||
```bash | ||
python scripts/06-featurize-syntrees.py \ | ||
--input-dir "data/pre-process/syntrees/" \ | ||
--output-dir "data/featurized/" --verbose | ||
``` | ||
|
||
This script will load the `{train,valid,test}` data, featurize it, and save it in | ||
- `<output-dir>/{train,valid,test}_states.npz` and | ||
- `<output-dir>/{train,valid,test}_steps.npz`. | ||
|
||
The encoders for the molecules must be provided in the script. | ||
A short text summary of the encoders will be saved as well. | ||
|
||
7. Split features | ||
|
||
Up to this point, we worked with a (featurized) *synthetic tree* as a whole, | ||
now we split it up to into "consumable" input/output data for each of the four networks. | ||
This includes picking the right featurized data from the "super step" vector from the previous step. | ||
|
||
```bash | ||
python scripts/07-split-data-for-networks.py \ | ||
--input-dir "data/featurized/" | ||
``` | ||
|
||
This will create 24 new files (3 splits, 4 networks, X + y). | ||
All new files will be saved in `<input-dir>/Xy`. | ||
|
||
8. Train the networks | ||
|
||
Finally, we can train each of the four networks in `src/synnet/models/` separately. For example: | ||
|
||
```bash | ||
python src/synnet/models/act.py | ||
``` | ||
|
||
After training a new model, you can then use the trained model to make predictions and construct synthetic trees for a list given set of molecules. | ||
|
||
You can also perform molecular optimization using a genetic algorithm. | ||
|
||
Please refer to the [README.md](./README.md) for inference instructions. | ||
|
||
## Auxiallary Scripts | ||
|
||
### Visualizing trees | ||
|
||
To visualize trees, there is a hacky script that represents *Synthetic Trees* as [mermaid](https://github.com/mermaid-js/mermaid) diagrams. | ||
|
||
To demo it: | ||
|
||
```bash | ||
python src/synnet/visualize/visualizer.py | ||
``` | ||
|
||
Still to be implemented: i) target molecule, ii) "end" action | ||
|
||
To render the markdown file incl. the diagram directly in VS Code, install the extension [vscode-markdown-mermaid](https://github.com/mjbvz/vscode-markdown-mermaid) and use the built-in markdown preview. | ||
|
||
*Info*: If the images of the molecules do not load, edit + save the markdown file anywhere. For example add and delete a character with the preview open. Not sure why this happens. |
Oops, something went wrong.