-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Joss paper #17
base: main
Are you sure you want to change the base?
Joss paper #17
Changes from all commits
43c5755
37f8359
892c240
44ec33c
45bab68
2a0551d
c45a3b5
514fa9d
18c954f
ad43ddb
e59e683
bac10a6
4dbc517
2852409
1915327
08de7df
c06aa4e
0fecbbe
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
name: paper | ||
|
||
on: | ||
push: | ||
branches: | ||
- "joss-paper" | ||
paths: | ||
- "paper.md" | ||
|
||
jobs: | ||
paper: | ||
runs-on: ubuntu-latest | ||
name: Paper Draft | ||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@v2 | ||
|
||
- name: Build draft PDF | ||
uses: openjournals/openjournals-draft-action@master | ||
with: | ||
journal: joss | ||
paper-path: paper.md | ||
|
||
- name: Upload | ||
uses: actions/upload-artifact@v1 | ||
with: | ||
name: paper | ||
# This is the output path where Pandoc will write the compiled | ||
# PDF. Note, this should be the same directory as the input | ||
# paper.md | ||
path: paper.pdf |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
--- | ||
title: 'pyscreener: A Python Wrapper for Computational Docking Software' | ||
tags: | ||
- Python | ||
- drug discovery | ||
- distributed computing | ||
- molecular docking | ||
authors: | ||
- name: David E. Graff | ||
orcid: 0000-0003-1250-3329 | ||
affiliation: "1, 2" | ||
- name: Connor W. Coley^[corresponding author] | ||
orcid: 0000-0002-8271-8723 | ||
affiliation: "2, 3" | ||
affiliations: | ||
- name: Department of Chemistry and Chemical Biology, Harvard University | ||
index: 1 | ||
- name: Department of Chemical Engineering, Massachusetts Institute of Technology | ||
index: 2 | ||
- name: Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology | ||
index: 3 | ||
date: XX November 2021 | ||
bibliography: refs.bib | ||
--- | ||
|
||
# Summary | ||
`pyscreener` is a Python library that seeks to alleviate the challenges of large-scale structure-based design using computational docking. It provides a simple and uniform interface that is agnostic to the backend docking engine with which to calculate the docking score of a given molecule in a specified active site. Additionally, `pyscreener` features first-class support for task distribution, allowing users to seamlessly scale their code from a local, multi-core setup to a large, heterogeneous resource allocation. | ||
|
||
# Statement of Need | ||
Computational docking is an important technique in structure-based drug design that enables the rapid approximation of binding affinity for a candidate ligand in a matter of CPU seconds. With the growing popularity of ultra-large ligand libraries, docking is increasingly used to sift through hundreds of millions of compounds to try to identify novel and potent binders for a variety of protein targets [@gorgulla_open-source_2020; @lyu_ultra-large_2019]. There are many choices of docking software, and certain software are better suited towards specific protein-ligand contexts (e.g., flexible protein side chains or sugar-like ligand molecules). Switching between these software is often not trivial as the input preparation, simulation, and output parsing pipelines differ between each software. | ||
|
||
In addition, many of these programs exist only as command-line applications and lack Python bindings. | ||
This presents an additional challenge for their integration into molecular optimization workflows, such as reinforcement learning or genetic algorithms. | ||
Molecular optimization objectives have largely been limited to benchmark tasks, such as penalized logP, QED, JNK3 or GSK3$\beta$ inhibitor classification [@li_multi-objective_2018], and others contained in the GuacaMol library [@brown_guacamol_2019]. These benchmarks are useful for comparing molecular design techniques, but they are not representative of true drug discovery tasks in terms of complexity; computational docking is at least one step in the right direction. | ||
|
||
While many molecular optimization techniques propose new molecules in the form of SMILES strings [@elton_deep_2019], most docking programs accept input in the form of molecular supply files with predefined 3D geometry (e.g., Mol2 or PDBQT format). Using the docking score of a molecule as a design objective thus requires an ad hoc implementation for which no standardized approach exists. | ||
The `vina` library [@eberhardt_autodock_2021] is currently the only library capable of performing molecular docking within Python code, but it is limited to docking molecules using solely AutoDock Vina as the backend docking engine. | ||
Moreover, the object model of the `vina` library accepts input ligands only as PDBQT files or strings and still does not address the need to quickly calculate the docking score of a molecule from its SMILES string. | ||
|
||
In our work on the MolPAL software [@graff_accelerating_2021], we required a library that is able to accept molecular inputs as SMILES strings and output their corresponding docking scores for a given receptor and docking box. Our use-case also called for docking large batches of molecules across large and distributed hardware setups. Lastly, we desired that our library be flexible with respect to the underlying docking engine, allowing us to use a variety of backend docking software (e.g., Vina [@trott_autodock_2010], Smina [@koes_lessons_2013], QVina [@alhossary_fast_2015], or DOCK6 [@allen_dock_2015]) with minimal changes to client code. To that end, we developed `pyscreener`, a Python library that is flexible with respect to both molecular input format and docking engine that transparently handles the distribution of docking simulations across large resource allocations. | ||
|
||
# Implementation and Performance | ||
The primary design goals with `pyscreener` were to (1) provide a simple interface with which to calculate the docking score of an input small molecule and (2) transparently distribute the corresponding docking simulations across a large resource allocation. The object model of `pyscreener` relies on four classes: `CalculationData`, `CalculationMetadata`, `DockingRunner`, and `DockingVirtualScreen`. A docking simulation in `pyscreener` is fully described by a `CalculationData` and an associated `CalculationMetadata`. High-level information about the simulation that is common to all docking software (e.g., target protein, docking box, name of the ligand, the paths under which inputs and outputs will be stored) is stored in the `CalculationData` object. A `CalculationMetadata` contains the set of software-specific arguments for the simulation, such as `exhaustiveness` for the AutoDock Vina family of software or parameters for SPH file preparation for DOCK6. | ||
|
||
The `pyscreener` object model separates data from behavior by placing the responsibility of actually preparing, running, and parsing simulations inside the `DockingRunner` class. This stateless class defines methods to prepare simulation inputs, perform the simulation of the corresponding inputs, and parse the resulting output for a given `CalculationData` and `CalculationMetadata` pair. By placing this logic inside static methods rather than attaching them to the `CalculationData` object, `pyscreener` limits network data transfer overhead during task distribution. The separation also allows for the straightforward addition of new backend docking engines to `pyscreener`, as this entails only the specification of the corresponding `CalculationMetadata` and `DockingRunner` subclasses. | ||
|
||
`pyscreener` also contains the `DockingVirtualScreen` class, which contains a template `CalculationData` and `CalculationMetadata` with which to dock each input molecule, i.e., a virtual screening protocol. The class defines a `__call__()` method which takes SMILES strings or chemical files in any format supported by OpenBabel [@oboyle_open_2011] as input and distributes the corresponding docking simulations across the resources in the given hardware allocation, returning a docking score for each input molecule. | ||
|
||
![Wall-time of the computational docking of all 1,615 FDA-approved drugs against 5WIU using QVina2 over six CPU cores for a single-node setup with the specified number of CPU cores. (Left) calculated speedup. (Right) wall time in minutes. Bars reflect mean $\pm$ standard deviation over three runs.\label{fig:local}](figures/timing-local.png) | ||
|
||
To handle task distribution, `pyscreener` relies on the `ray` library [@moritz_ray_2018] for distributed computation. For multithreaded docking software, `pyscreener` allows a user to specify how many CPU cores to run each individual docking simulation over, running as many docking simulations in parallel as possible for a given number of total CPU cores in the `ray` cluster. To examine the scaling behavior of `pyscreener`, we docked all 1,615 FDA-approved drugs into the active site of the D4 dopamine receptor (PDB ID 5WIU [@wang_d4_2017]) with QVina2 running over 6 CPU cores. We tested both single node hardware setups, scaling the total number of CPU cores on one machine, and multi-node setups, scaling the total number of machines. In the single-node case, `pyscreener` exhibited essentially perfect scaling \autoref{fig:local} as we scaled the size of the `ray` cluster from 6 to 48 CPU cores running QVina over 6 CPU cores. | ||
|
||
![Wall-time of the computational docking of all 1,615 FDA-approved drugs against 5WIU using QVina2 over six CPU cores for setups using multiple 48-core nodes with the total number of specified CPU cores. (Left) calculated speedup. (Right) wall time in minutes. Bars reflect mean $\pm$ standard deviation over three runs.\label{fig:dist}](figures/timing-dist.png) | ||
|
||
In contrast, the multi-node setup exhibits less ideal scaling \autoref{fig:dist} with a measured speedup approximately 55% that of perfect scaling. We attribute this scaling behavior to hardware-dependent network communication overhead. Distributing a `sleep(5)` function allocated 6 CPU cores per task (to mimic a fairly quick docking simulation) in parallel over differing hardware setups | ||
led to an approximate 2.5% increase in wall-time relative to the single-node setup each time the number of nodes in the setup was doubled while keeping the total number of CPU cores the same. Such a trend is consistent with network communication being detrimental to scaling behavior. This test also communicated the absolute minimum amount of data over the network, as there were no function arguments or return values. When communicating `CalculationData` objects (approximately 600 bytes in serialized form) over the network, as in `pyscreener`, the drop increased to 6% for each doubling of the total number of nodes. Minimizing the total size of `CalculationData` objects was therefore an explicit implementation goal. Future development will seek to further reduce network communication overhead costs to bring `pyscreener` scaling closer to ideal scaling. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is quite an interesting observation, could you also add what is the configuration of the cluster? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://www.rc.fas.harvard.edu/about/cluster-architecture/ contains the cluster architecture information. Because I'm not completely familiar with the question, what would be the necessary details to include here? |
||
# Examples | ||
To illustrate `pyscreener`, we consider docking benezene (SMILES string `"c1ccccc1"`) against 5WIU with a docking box centered at (-18.2, 14.4, -16.1) with x-, y-, and z-radii (15.4, 13.9, 14.5). We may perform this docking using AutoDock Vina over 6 CPU cores via `pyscreener` like so: | ||
```python | ||
import pyscreener as ps | ||
|
||
metadata = ps.build_metadata("vina") | ||
virtual_screen = ps.virtual_screen( | ||
"vina", | ||
receptors=["5WIU.pdb"], | ||
center=(-18.2, 14.4, -16.1), | ||
size=(15.4, 13.9, 14.5), | ||
metadata_template=metadata, | ||
ncpu=6 | ||
) | ||
scores = virtual_screen("c1ccccc1") | ||
print(scores) | ||
# [-4.4] | ||
``` | ||
|
||
Alternatively, we may dock many molecules by passing a `List` of SMILES strings to the `DockingVirtualScreen`: | ||
```python | ||
smis = [ | ||
"c1ccccc1", | ||
"O=C(Cc1ccccc1)NC1C(=O)N2C1SC(C2C(=O)O)(C)C", | ||
"C=CCN1CCC23C4C(=O)CCC2(C1CC5=C3C(=C(C=C5)O)O4)O" | ||
] | ||
scores = virtual_screen(smis) | ||
print(scores.shape) | ||
# (3,) | ||
``` | ||
|
||
By default, AutoDock Vina docks molecules using an `--exhaustiveness` value of 8, but we may specify a higher number in the `metadata`: | ||
```python | ||
metadata = ps.build_metadata("vina", dict(exhaustivness=32)) | ||
``` | ||
We may also utilize other docking engines in the AutoDock Vina family by specifying the `software` for Vina-type metadata. Here, we use the accelerated optimization routine of QVina for faster docking. Note that we also support `software` values of `"smina"` [@koes_lessons_2013] and `"psovina"` [@ng_psovina_2015] in addition to `"vina"` [@trott_autodock_2010] and `"qvina"` [@alhossary_fast_2015]. | ||
```python | ||
metadata = ps.build_metadata("vina", dict(software="qvina")) | ||
``` | ||
|
||
It is also possible to dock molecules using DOCK6 [@allen_dock_2015] in `pyscreener`. To do this, we must first construct DOCK6 metadata and specify that we are creating a DOCK6 virtual screen (note that DOCK6 is not multithreaded and thus does not benefit from being assigned multiple CPU cores per task): | ||
```python | ||
metadata = ps.build_metadata("dock") | ||
virtual_screen = ps.virtual_screen( | ||
"dock", | ||
receptors=["5WIU.pdb"], | ||
center=(-18.2, 14.4, -16.1), | ||
size=(15.4, 13.9, 14.5), | ||
metadata_template=metadata | ||
) | ||
scores = virtual_screen("c1ccccc1") | ||
print(scores) | ||
# [-12.35] | ||
``` | ||
|
||
# Acknowledgements | ||
|
||
The authors thank Keir Adams and Wenhao Gao for providing feedback on the preparation of this paper and the `pyscreener` code. The computations in this paper were run on the FASRC Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University. The authors also acknowledge the MIT SuperCloud and Lincoln Laboratory Supercomputing Center for providing HPC and consultation resources that have contributed to the research results reported within this paper. This work was funded by the MIT-IBM Watson AI Lab. | ||
|
||
# References |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure that I'm doing something wrong here (maybe using the wrong flags or backend?) But on a cluster when I run
pyscreener --config integration-tests/configs/test_vina.ini -nc 32
it takes0h 36m 5.93s
but when I runpyscreener --config integration-tests/configs/test_vina.ini -nc 64
it takes1h 12m 9.79s
. This is with the vina backend and my hunch is that I'm using the flags incorrectly and instead of parallelizing jobs across the CPUs, I'm trying to give vinaN
CPUs for each job. I've attached the full output below:https://gist.github.com/mikemhenry/e0075ee01466145c116f2b8ebeda35b7
cc openjournals/joss-reviews#3950
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the timing look like for vina with 64 vs 32 CPUs on your computer? I'm unfamiliar with any firm vina scaling benchmarks, but in my experience I get severely diminishing returns beyond 8 CPUs. If you get no benefit doubling the CPUs per job from 32 to 64, then all that would serve to do is limit the number of parallel jobs you're able to run (2 jobs simultaneously in the 32 CPUs/job case vs only 1 in the 64 CPUs/job.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! I was able to verify the performance claim with a small test: