Skip to content

Latest commit

 

History

History
104 lines (76 loc) · 4.78 KB

README.md

File metadata and controls

104 lines (76 loc) · 4.78 KB

MOFgraph2vec

A metal-organic framework (MOF) recommendation system based on Doc2Vec.

Installation

We suggest you install the package into a seperate conda environment with Python 3.8 or higher. Follow the steps below:

$ conda create -n mof2vec python=3.8 -y
$ conda activate mof2vec
$ git clone https://github.com/XiaoqZhang/mofgraph2vec.git
$ cd mofgraph2vec
$ python -m pip install -e .

Getting started

Train the model

  • To prepare the data, you need a folder that contains a folder with all MOF CIFs and a .csv file with geometry and topology information. Examples are shown in example_data.
  • Change the configurations in conf/. See more details in Configuration parameters.
  • Navigate to the folder experiments/.
  • Run the model by $ python train.py. The first time of featurization may take some time.

Load the pre-trained models

  • The pre-trained models for ARC-MOF and QMOF databases are attached in Releases.
  • Example of loading the pre-trained models is provided in dev/example.ipynb.

Similarity analysis

  • Examples are given in dev/example.ipynb.

Configuration parameters

You can easily tune the model parameters in conf folder.

config.yaml

  • mof2vec_data: Configuration specific to the MOF embedding data.
  • mof2vec_model: Configuration related the MOF embedding model.
  • doc2label_data: Configuration for the data for training the downstream regression model.
  • doc2label_model: Configuration for the downstream regression model.
  • sweep: Wandb hyperparameter sweeping configuration.

logger

This section contains setting for wandb logging and tracking experiments.

seed

Fix random seed, ensuring reproducibility in experiments.

sweep

A boolean indicating whether wandb hyperparameter sweeping is enabled or not.

mode

  • mof2vec: Only run MOF embedding.
  • doc2lable: Only run downstream regression model. In this mode, pretraining should be set to False and MOF embeddings should be provided in embedding_path in the configuration file doc2label_data/default.yaml.
  • workflow: Run the MOF embedding and use the embedding as features for the downstream regression model.

mof2vec_data

  • cif_path: The path to the folder that contains all the .cif files to embed.
  • embed_label: A boolean indicating whether to parsing geometric properties of MOFs.
  • label_path: If embed_label is set to True. The .csv file that contains the geometry information should be provided.
  • descriptors_to_embed: The numeric columns to parse in the .csv file.
  • category_to_embed: The category columns to parse in the .csv file.
  • id_column: The column that contains the cif names.
  • wl_step: The number of steps in extracting the rooted substructures.

mof2vec_model

  • vector_size: Dimensionality of the embeddings.
  • window: The maximum distance between the current and predicted word within a MOF document.
  • min_count: Ignores all words with total frequency lower than this.
  • dm: Defines the training algorithm. If dm=1, ‘distributed memory’ (PV-DM) is used. Otherwise, distributed bag of words (PV-DBOW) is employed.
  • sample: The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).
  • workers: Use these many worker threads to train the model (=faster training with multicore machines).
  • alpha: The initial learning rate.
  • epochs: Number of iterations (epochs) over the corpus. More parameters can be added. Find the details in gensim.models.doc2vec.Doc2Vec

doc2label_data

  • pretraining: If set True, the MOF embedding model is trained from scratch. If set to False, load MOF embeddings from pre-trained models. In this case, either embedding_model_path or embedding_path should be provided.
  • embedding_model_path: Pretrained embedding model path.
  • embedding_path: MOF embeddings from pre-trained models.
  • label_path: The path to the .csv file that contains the downstream regression data.
  • task: The task column for the downstream regression model.
  • MOF_id: The column that contains MOF names.
  • train_frac: The training data size.
  • test_frac: The test data size.

doc2label_model

grid_search

Specify whether to turn on a Grid search or not. If True, all the values in params should be provided as list. The grid search is performed with a 5-fold cross-validation. If False, all the values should be float or int.

params

The hyperparameters for the supervised XGBoost model.

n_jobs

Number of jobs to run in parallel.

License

This project is licensed under the MIT License. See the LICENSE file for more information.