pKalculator

pKalculator is a fully automated quantum chemistry (QM)-based workflow that computes the C-H pKa values of molecules. The QM workflow uses GNF2-xTB with ORCA on top. pKalculator also includes an atom-based machine learning model (ML) to predict the C-H pKa values. The ML model (LightGBM regression model) is based on CM5 atomic charges that are computed using semiempirical tight binding (GFN1-xTB).

For more, see pKalculator: A pKa predictor for C-H bonds

Installation

We recommend using conda to get the required dependencies

conda env create -f environment.yml && conda activate pkalculator

We recommend downloading the precompiled binaries for the latest version of xTB (v. 6.7.0)

mkdir dep; cd dep; wget https://github.com/grimme-lab/xtb/releases/download/v6.7.0/xtb-6.7.0-linux-x86_64.tar.xz; tar -xvf ./xtb-6.7.0-linux-x86_64.tar.xz; mv xtb-dist xtb-6.7.0; cd ..

If this does not work for your system, xTB can also be installed through conda conda install -c conda-forge xtb

For more information, please see: https://xtb-docs.readthedocs.io/en/latest/setup.html

Hereafter, ORCA (v. 5.0.4) is required for the QM workflow. Installation instructions can be found at https://www.orcasoftware.de/tutorials_orca/first_steps/install.html and https://sites.google.com/site/orcainputlibrary/setting-up-orca

ORCA requires a specific path for our QM workflow to work. Therefore, change the paths under the function run_orca_calculation in qm_pkalculator/run_orca.py.

Usage

Both our QM workflow and ML workflow are accessible through the command line in the terminal.

QM workflow

QM calculations

Below is an example of how to start the QM workflow:

python qm_pkalculator/qm_pkalculator.py -f CAM-B3LYP -b def2-TZVPPD -s DMSO -d -o -q

This will start the QM workflow with a the test.csv file located under data/qm_calculations/test.csv

The arguments for qm_pkalculator.py are explained below:

Arguments	Description
`-cpus`	Number of cpus per job. Defaults to 4 cpus
`-mem`	Amount of memory per job in GB. Defaults to 8 GB
`-p`	Set the SLURM partion to be used at your HPC. Defaults to kemi1
`-csv`	Csv path. The csv file must be comma seperated and contain a 'names' column and a 'smiles' column. Defaults to "data/qm_calculations/test.csv"
`-calc`	Path for saving calculations. Defaults to "data/qm_calculations/calc_test"
`-submit`	Path for saving results from submitit. Defaults to "data/qm_calculations/submit_test"
`-f`	The functional to be used. Defaults to 'CAM-B3LYP'
`-b`	which basis set. Defaults to 'def2-TZVPPD'
`-s`	solvent for the. Defaults to 'DMSO'
`-d`	Set if D4 dispersion correction. This is recommended.
`-o`	Set if optimization is needed. This is recommended.
`-q`	Set if frequency computations are required. This is recommended.

If needed, SLURM commands can be updated to work at your HPC.

timeout_min| The total time that is allowed for each SLURM job before time out.
slurm_array_parallelism| Maximum number SLURM jobs to run simultaneously.

Producing the dataframe with results

The QM workflow produces a preliminary dataframe with both the neutral smiles and deprotonated smiles that is needed to run for determining the QM computed C-H pKa values. The default location is here: data/qm_calculations/.

After the QM calculations are completed, please run:

python qm_pkalculator/etl.py

Now, the resulting dataframe with QM calculations are produced with the default location is here: data/qm_calculations/.

The arguments for etl.py are explained below:

Arguments	Description
`-calc`	Path for saving calculations. Defaults to "data/qm_calculations/calc_test"
`-submit`	Path for saving results from submitit. Defaults to "data/qm_calculations/submit_test"
`-prelim`	path where the preliminary dataframe is. Defaults to "data/qm_calculations/df_prelim_calc_test.pkl"
`-result`	Path where the resulting dataframe is placed. Defaults to "data/qm_calculations/df_results_calc_test.pkl"

ML workflow

Below is an example of how to use the ML workflow:

python ml_pkalculator/ml_pkalculator.py -s CC(=O)Cc1ccccc1 -n comp2 -m models/reg_model_all_data_dart.txt

The arguments for the ML workflow are explained below:

Arguments	Description
`-s`	SMILES string. Defaults to 'CC(=O)Cc1ccccc1'
`-n`	Name of the compound. Defaults to 'comp2'
`-m`	Which model to be used. Defaults to the full regression model
`-e`	Identify the possible site of reaction within {number} pKa units of the lowest pKa value. Defaults to 0.0.

Hereafter, a list of tuples are returned: [(0, 23.14), (3, 18.78), (5, 42.42), (6, 42.9), (7, 43.27)]

The first element in each tuple is the atom index and the second element in each tuple is the ML predicted pKa value for that atom index.

The workflow then produces an .png or .svg image of the molecule with its atom indices for easy comparison. The image of the molecule will also contain a teal circle that highlights the site with the lowest pKa value or within {number} pKa units from the lowest pKa value. The .png or .svg image is by default saved to data/ml_predictions/.

Data

Computed data for CM5 charges

Both the QM workflow and the ML workflow uses GFN1-xTB to produce CM5 charges. The data from the xTB calculation is saved to data/calc_smi2gcs.

Additional data

All additionl data can be found on here

Here the data is split into three folders: datasets, qm_data and ml_data. The description for each folder is found below:

Folder	Description
`datasets`	Includes all datasets. Each `.pkl` contains a pandas DataFrame that can be loaded using the following command `pd.read_pickle(datasets/{dataset name}, compression={'method': 'gzip'})`.
`qm_data/calculations`	Includes all QM calculations, including .xyz files and .log files.
`ml_data/models`	Includes trained ML models on either all data or ML models trained on the training set (80 % of the data).
`ml_data/validation`	Includes data from the cross-validation for the ML models. The .log files gives an overview of the performance metrics.

Citation

Beilstein J. Org. Chem. 2024, 20, 1614-1622

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pKalculator

Installation

Usage

QM workflow

QM calculations

Producing the dataframe with results

ML workflow

Data

Computed data for CM5 charges

Additional data

Citation

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
data		data
ml_pkalculator		ml_pkalculator
model		model
qm_pkalculator		qm_pkalculator
smi2gcs		smi2gcs
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

License

jensengroup/pKalculator

Folders and files

Latest commit

History

Repository files navigation

pKalculator

Installation

Usage

QM workflow

QM calculations

Producing the dataframe with results

ML workflow

Data

Computed data for CM5 charges

Additional data

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages