EnzymeFlow: Generating Reaction-specific Enzyme Catalytic Pockets through Flow-Matching and Co-Evolutionary Dynamics

Paper at arxiv. Fine-tuning or Training on Catalytic Pockets. Pre-training can be found at link.

Requirement

python>=3.11
CUDA=12.1
torch==2.4.1 (>=2.0.0)
torch_geometric==2.4.0

pip install mdtraj==1.10.0 (do first will install numpy, scipy as well, install later might raise dependency issues)
pip install pytorch-warmup==0.1.1
pip install POT==0.9.4
pip install rdkit==2023.9.5
pip install biopython==1.84
pip install tmtools==0.2.0
pip install geomstats==2.7.0
pip install dm-tree==0.1.8
pip install ml_collections==0.1.1
pip install OpenMM
pip install einx
pip install einops

conda install conda-forge::pdbfixer

Model Training

Please refer to the below, to see how we prepare training data.
configs.py contain all training configurations and hyperparameters.
Train model using train_ddp.py for parallal training with multi-gpus (we trained with 4 A40 gpus).

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 train_ddp.py

The training loads pre-trained model. You may also train from scratch by setting the configs in configs.py, setting parameters ckpt_from_pretrain=False pretrain_ckpt_path=None.

Model Weights

A mini-EnzymeFlow checkpoint is put in Google drive. Once you download it, put it under ./checkpoint folder.

Model Inference

EnzymeFlow inference demo is provided in jupyter notebook.

Baseline Experiments

1. RFDiff-AA

For RFDiffAA and LigandMPNN, please refer to RFDiffAA-official and LigandMPNN-official. For each enzyme-reaction pair in evaluation data, we use RFDiffAA with default params to generate 100 catalytic pockets (with 32 residues) for each unique substrate. Then we use LigandMPNN to perform sequence prediction (inverse folding) on the generated catalytic pockets post-hoc.

We provide some RFDiffAA-generated samples in ./data/rfdiffaa_generated folder at link.

We provide LigandMPNN-predicted sequences for RFDiffAA-generated pockets at file.

We provide CLEAN-predicted EC-Class for LigandMPNN-predicted pocket sequences at file.

2. Enzyme Commission Classifcation

Baselines like RFDiffAA or others do not generate EC-class for the design of catalytic pockets. We use CLEAN to infer the EC-class of sequence representations of these pockets. For CLEAN, please refer to CLEAN-official or CLEAN-webserver. We use CLEAN with greedy max-separation approach for EC-class inference.

3. ESM3

For ESM3, please refer to ESM3-official. For each sequence representation of generated catalytic pocket, we use ESM3 to recover the full enzyme sequence (by 'entire' meaning, we recover 32 residues into a protein sequence of 200 residues). We can perform enzyme retrieval on both (1) pocket enzymes sequences and (2) full enzyme sequences. ESM3 prompting is at link.

4. Pocket-specified Enzyme CLIP

For ranking-based retrieval evaluation, please refer to RectZyme-paper. We train a pocket-specific enzyme CLIP model with enzyme pockets features computed by latest ESM3 and reactions features computed by MAT-2D. The training data are those of 60%-homology (~50,000 positive samples); evaluation data are those unique, non-repeated ones; training negative samples are training data that are not annotated to catalyze a specific reaction like ClipZyme; evaluation do not use negative data.

Data Preparation

1. Enzyme Pocket, Substrate Molecule, Product Molecule Rawdata

$~~~~$ (a) molecule_structures folder in ./data contain all substrate and product molecules, can be downloaded at link.

$~~~~$ (b) pocket_fixed_residues/pdb_10A folder in ./data contain all enzyme pockets, can be downloaded at link.

$~~~~$ (c) We provide rawdata-40%homology and metadata-40%homology with 40% homologys in ./data folder. More rawdata (50%, 60%, 80%, 90% homologys) can be downloaded at link.

2. Co-evolution and MSA

$~~~~$ (a) rxn_to_smiles_msa.pkl in ./data contain reaction MSAs.

$~~~~$ (b) uid_to_protein_msa.pkl in ./data contain enzyme MSAs, can be downloaded at link.

$~~~~$ (c) vocab.txt in ./data is co-evolution vocabulary.

When the raw data--enzyme pockets, molecules, co-evolution--are ready (stored in right folders), we proceed to process them into metadata.

3. Process rawdata into metadata by running `process_data.py`.

$~~~~$ (a) Remember to change the configs --rawdata_file_name, e.g., python process_data.py --rawdata_file_name rawdata_cutoff-0.4.csv. Warning: we have absolute path in metadata.csv, so you might need to change it to your path.

4. Processed Metadata.

$~~~~$ (a) Processed metadata will be saved into ./data/processed folder, including:

$~~~~$ (b) processed enzyme in ./data/processed/protein folder.

$~~~~$ (c) processed substrate in ./data/processed/ligand folder.

$~~~~$ (d) processed co-evolution in ./data/processed/msa folder.

$~~~~$ (e) processed produuct in ./data/processed/product folder.

$~~~~$ (f) a toy example is provided.

5. Evaluation Sample.

$~~~~$ (a) We provide eval-rawdata and eval-metadata in ./data folder. Warning: we have absolute path in metadata.csv, so you might need to change it to your path.

$~~~~$ (b) We provide unprocessed-eval-data in ./data/raw_eval_data folder.

$~~~~$ (c) We provide processed-eval-data in ./data/processed_eval folder.

$~~~~$ (d) You can also process evaluation data by running process_data.py. Remeber to change the configs, e.g., python process_data.py --rawdata_file_name eval-data_cutoff-0.1_unique-subs-enz_100.csv --metadata_file_name metadata_eval.csv.

Name		Name	Last commit message	Last commit date
Latest commit History 235 Commits
Pretrain		Pretrain
data		data
evaluation		evaluation
flowmatch		flowmatch
image		image
model		model
ofold		ofold
README.md		README.md
configs.py		configs.py
enzymeflow_demo.ipynb		enzymeflow_demo.ipynb
eval_configs.py		eval_configs.py
inference.py		inference.py
process_data.py		process_data.py
sampling.py		sampling.py
train_ddp.py		train_ddp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EnzymeFlow: Generating Reaction-specific Enzyme Catalytic Pockets through Flow-Matching and Co-Evolutionary Dynamics

Requirement

Model Training

Model Weights

Model Inference

Baseline Experiments

1. RFDiff-AA

2. Enzyme Commission Classifcation

3. ESM3

4. Pocket-specified Enzyme CLIP

Data Preparation

1. Enzyme Pocket, Substrate Molecule, Product Molecule Rawdata

2. Co-evolution and MSA

When the raw data--enzyme pockets, molecules, co-evolution--are ready (stored in right folders), we proceed to process them into metadata.

3. Process rawdata into metadata by running `process_data.py`.

4. Processed Metadata.

5. Evaluation Sample.

Further Statistics

About

Releases

Packages

Languages

WillHua127/EnzymeFlow

Folders and files

Latest commit

History

Repository files navigation

EnzymeFlow: Generating Reaction-specific Enzyme Catalytic Pockets through Flow-Matching and Co-Evolutionary Dynamics

Requirement

Model Training

Model Weights

Model Inference

Baseline Experiments

1. RFDiff-AA

2. Enzyme Commission Classifcation

3. ESM3

4. Pocket-specified Enzyme CLIP

Data Preparation

1. Enzyme Pocket, Substrate Molecule, Product Molecule Rawdata

2. Co-evolution and MSA

When the raw data--enzyme pockets, molecules, co-evolution--are ready (stored in right folders), we proceed to process them into metadata.

3. Process rawdata into metadata by running process_data.py.

4. Processed Metadata.

5. Evaluation Sample.

Further Statistics

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

3. Process rawdata into metadata by running `process_data.py`.

Packages