This repository contains the implementation of the Vendi Bayesian Optimization (VBO), a Bayesian optimization algorithm that seeks to find diverse solutions to a black-box optimization problem, applied to a metal-organic framework (MOF) design space to optimize for MOF properties. We do this by designing a MOF-specific kernel that accounts for a multitude of informative MOF properties, and using the Vendi Score to encourage the candidate MOFs to be inspected to be diverse from one another.
The workflow of our VBO framework, where diverse MOFs are iteratively selected to optimize NH₃ adsorption capabilities.
For more information, please see our paper, Diversity-driven, efficient exploration of a MOF design space to optimize MOF properties: application to NH₃ adsorption.
You can clone this repository and install the necessary dependencies using the following commands:
conda create --name vbo
conda activate vbo
conda install python==3.11.3
pip install -r requirements.txt
Download the data necessary to run our code at this link.
The downloaded data
folder should be placed in the root directory of this repository, replacing the default data
folder.
- The spreadsheet
DB_10042.csv
contains our entire MOF database, where each row corresponds to a specific MOF. - The spreadsheet
Label1000.csv
contains our randomly selected 1000 MOFs that we labeled to validate our model and methods.
Various quantities can be precomputed once and loaded by our model on demand during the Bayesian optimization loop.
- The NumPy array
DB_10042_node_sims2.npy
contains the kernel matrix based on node SMILES similarities of all data points in the database (denoted as$K_\text{node}$ in our paper). - The NumPy array
DB_10042_linker_sims2.npy
contains the kernel matrix based on linker SMILES similarities of all data points in the database (denoted as$K_\text{linker}$ in our paper). - The NumPy array
DB_10042_distributional_sims.npy
contains the kernel matrix based on similarities of pore size distribution of all data points in the database (denoted as$K_\text{PSD}$ in our paper). - The NumPy array
DB_10042_fixed_covariance_matrix.npy
contains the kernel matrix that combines the four individual base kernels according to Eq. (6) in our paper, where each weight$w_i = 0.25$ (see Section 3.3 of our paper).
The simulated_runs
folder contains results from our trial experiments that make up Figure 6 of our paper.
See the next section for how to reproduce these results.
The script mof_search/utils.py
contains the implementation of our Gaussian process model and other helper functions, which are used in mof_search/run_bo.py
to facilitate an optimization run.
To start an optimization run, run mof_search/run_bo.py
with the arguments of your choice.
- The
method
argument specifies the optimization method to run, supportingVBO
(our method),BO
(the traditional Bayesian optimization baseline), andrandom
(random search). - The
target
argument specifies the metric to be optimized, supportingM_Storage
,M_DBD
, andM_safety
; for more details on these metrics, refer to Section 2.3 of our paper.
An example command is shown below:
cd mof_search
python run_bo.py --method VBO --target M_Storage --seed 0
You can run the three methods with seeds 0–9 to replicate our experiment results.
You can further run the corresponding Jupyter notebooks in notebooks
to follow our data-processing procedure (in subfolder notebooks/Preprocessing
) and recreate the figures in our paper (in subfloder notebooks/Recreate figures
).
@article{liu2024diversity,
title={{Diversity-driven, efficient exploration of a MOF design space to optimize MOF properties: application to NH_3 adsorption}},
author={Liu, Tsung-Wei and Nguyen, Quan and Dieng, Adji Bousso and Gomez-Gualdron, Diego},
journal={ChemRxiv preprint},
year={2024}
}