-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #8 from ramanathanlab/develop
Develop
- Loading branch information
Showing
4 changed files
with
146 additions
and
36 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,22 +1,24 @@ | ||
input_shape: [1, 28, 28] | ||
filters: [16, 16, 16, 16] | ||
kernels: [3, 3, 3, 3] | ||
strides: [1, 1, 1, 2] | ||
affine_widths: [128] | ||
affine_dropouts: [0.5] | ||
latent_dim: 3 | ||
lambda_rec: 1.0 | ||
num_data_workers: 4 | ||
prefetch_factor: 2 | ||
batch_size: 64 | ||
device: cuda | ||
optimizer_name: RMSprop | ||
optimizer_hparams: | ||
lr: 0.001 | ||
weight_decay: 0.00001 | ||
epochs: 20 | ||
checkpoint_log_every: 20 | ||
plot_log_every: 20 | ||
plot_n_samples: 5000 | ||
plot_method: raw | ||
|
||
# Parameters for a convolutional autoencoder as implemented in the mdlearn package. | ||
# For additional documentation on the input parameters, see here: | ||
# https://mdlearn.readthedocs.io/en/latest/pages/_autosummary/mdlearn.nn.models.vae.symmetric_conv2d_vae.html#mdlearn.nn.models.vae.symmetric_conv2d_vae.SymmetricConv2dVAETrainer | ||
input_shape: [1, 28, 28] # Contact matrix shape, in this case the number of residues in BBA. | ||
filters: [16, 16, 16, 16] # The convolution filters to use (should be same number as kernels and strides) | ||
kernels: [3, 3, 3, 3] # The convolution kernels to use | ||
strides: [1, 1, 1, 2] # The convolution strides to use | ||
affine_widths: [128] # The number of neurons in the linear layers (should be same number as affine_dropouts) | ||
affine_dropouts: [0.5] # The dropout to use in the linear layers | ||
latent_dim: 3 # The latent dimension of the autoencoder | ||
lambda_rec: 1.0 # How much to weight the reconstruction loss vs the KL divergence | ||
num_data_workers: 4 # The number of parallel data workers for loading data (performance tuning) | ||
prefetch_factor: 2 # How many batches each data worker should prefetch (performance tuning) | ||
batch_size: 64 # The batch size to use during training | ||
device: cuda # The device to train/infer with (cuda or cpu) | ||
optimizer_name: RMSprop # The optimizer used to train the model | ||
optimizer_hparams: # See the torch documentation for the above optimizer for details: https://pytorch.org/docs/stable/optim.html | ||
lr: 0.001 # Learning rate for the optimizer | ||
weight_decay: 0.00001 # Weight decay for the optimizer | ||
epochs: 20 # The number of epochs to train for, smaller systems generally need fewer epochs | ||
checkpoint_log_every: 20 # How often to log a model weight checkpoint file (we only use the last one logged, so set to number of epochs) | ||
plot_log_every: 20 # How often to log a plot of the autoenoder latent space (helpful for debugging the model -- clustering should be visually apparent) | ||
plot_n_samples: 5000 # The number of samples to plot | ||
plot_method: raw # Plot the "raw" latent coordinates in 3D, "PCA" of the embeddings, "TSNE" of the embeddings, etc. See https://mdlearn.readthedocs.io/en/latest/pages/_autosummary/mdlearn.visualize.html |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,22 +1,104 @@ | ||
# This is an example YAML configuration file for running DeepDriveMD on | ||
# a small workstation to fold the 1FME fast folding protein. All the data | ||
# for this workflow is self contained within the repository including | ||
# folded and unfolded structures. This is the best example to debug with | ||
# as you can scale the number of GPUs, simulation length, and other settings | ||
# using this small biomolecular system (28 residues) instead of a larger | ||
# compute-intensive system. This workflow configuration takes approximately 8 | ||
# hours to run to convergence on 4 V100 GPUs. | ||
|
||
# NOTE: There are more parameters that can be configured than are listed | ||
# here. Please refer to deepdrivemd/api.py:DeepDriveMDSettings for details. | ||
|
||
# NOTE: simulation_settings, train_settings, and inference_settings encapsulate | ||
# specific application parameters suited towards your biomolecular system and | ||
# machine learning training and inference algorithms. This is meant to be | ||
# an illustrative example for best practices for configuring your experiments | ||
# and exposing a convenient YAML interface to the input parameters you would like | ||
# to tune. You may find that this (or a different) specific deep learning model or simulation | ||
# script is suited to multiple problems, but DeepDriveMD is flexible and allows you | ||
# to add your own custom solutions. This workflow is geared towards simulating a system | ||
# from a starting state to some target, given as a PDB file via simulation_settings:rmsd_reference_pdb. | ||
# In this case, we are using it to fold the 1FME protein by minimizing the RMSD to | ||
# the native state. To start your modelling we recomend using the convolutational | ||
# variational autoencoder as configured below as a first step. You may need to adjust | ||
# the inference application if your task can not be cast as an RMSD minimization problem. | ||
|
||
|
||
# The simulation input directory. Should contain subfolders with PDB | ||
# files (and optional topology files) | ||
simulation_input_dir: data/1fme | ||
# The number of workers to use for all tasks (3 will be used for simulation, | ||
# 1 will be shared between train/infer tasks) | ||
num_workers: 4 | ||
# The number of simulations to run between training jobs (all the data produced | ||
# throughout the duration of the workflow is used for training) | ||
simulations_per_train: 6 | ||
# The number of simulations to run between inference jobs (inference is fast, | ||
# we want to select outliers as quickly as possible) | ||
simulations_per_inference: 1 | ||
# The total number of simulations to run before the workflow stops (1000 is | ||
# essentially infinity and requires manually stopping the workflow once | ||
# convergence is manually confirmed) | ||
num_total_simulations: 1000 | ||
|
||
# Compute settings can be configured by refering to deepdrivemd/parsl.py | ||
# The `name` field specifies what type of system to run on and the subsequent | ||
# arguments are conditional on the name field (e.g., a cluster may have different | ||
# configuration than a workstation). | ||
compute_settings: | ||
# Specify we want the workstation parsl configuration | ||
name: workstation | ||
# Identify which GPUs to assign tasks to. It's generally recommended to first check | ||
# nvidia-smi to see which GPUs are available. The numbers below are analogous to | ||
# setting CUDA_VISIBLE_DEVICES=0,1,2,3 | ||
available_accelerators: ["0", "1", "2", "3"] | ||
|
||
# The simulation settings as exposed in deepdrivemd/apps/openmm_simulation | ||
# This application uses OpenMM as a simulation backend and can be changed | ||
# to suit your modelling needs. To see the full list of tunable parameters, | ||
# see deepdrivemd/apps/openmm_simulation/__init__.py:MDSimulationSettings | ||
simulation_settings: | ||
# The number of nanoseconds to run each simulation for | ||
simulation_length_ns: 10 | ||
# How often to write a coordinate frame to a DCD file | ||
report_interval_ps: 10 | ||
# The temperature to simulate at | ||
temperature_kelvin: 300 | ||
# The reference PDB with which to compute RMSD of each reported frame to | ||
rmsd_reference_pdb: data/1fme/1FME-folded.pdb | ||
|
||
# The training settings for the convolutional variational autoencoder (CVAE). | ||
# Full documentation and the paper citation which describes the application of | ||
# the CVAE to contact maps can be found here: https://mdlearn.readthedocs.io/en/latest/pages/_autosummary/mdlearn.nn.models.vae.symmetric_conv2d_vae.html#module-mdlearn.nn.models.vae.symmetric_conv2d_vae | ||
train_settings: | ||
# Here we pass a YAML file containing all the CVAE parameters (documentation included) | ||
# This is just to avoid needing to copy and paste paramaters in both the train_settings and inference_settings | ||
cvae_settings_yaml: examples/bba-folding-workstation/cvae-prod-settings.yaml | ||
|
||
# The inference settings. For this workflow, the CVAE is periodically retrained | ||
# on all the reported frames of the simulations. The most recent CVAE model weights | ||
# are always used during inference. The inference application is responsible for analyzing | ||
# the reported simulation frames and selecting a small subset of frames that are | ||
# deemed biologically "interesting" which are then used to restart the subsequent simulations. | ||
# The algorithm employed in this application is as follows: | ||
# 1. Encode all the contact maps into the latent space learned by the CVAE. | ||
# 2. Run the Local Outlier Factor (LOF) on the latent embeddings: https://scikit-learn.org/stable/auto_examples/neighbors/plot_lof_outlier_detection.html | ||
# 3. Take the top `num_outliers` outliers which correspond to the most negative LOF score | ||
# 4. From the top outliers, re-sort them according to their RMSD to simulation_settings:rmsd_reference_pdb | ||
# 5. Repeat this each call to the inference function analyzing more and more data from the simulations | ||
# | ||
# Following this procedure, each time a simulation finishes, the workflow submits a new simulation | ||
# job using the frame corresponding to the next best outlier with minimal RMSD to the target state. | ||
# As the workflow progresses, the simulations begin to sample conformers that are closer to the target reference state. | ||
# To read the inference application logic, please see: deepdrivemd/apps/cvae_inference | ||
inference_settings: | ||
# The same CVAE paratameter file as in train_settings | ||
cvae_settings_yaml: examples/bba-folding-workstation/cvae-prod-settings.yaml | ||
# The number of latent space outliers to consider when picking the minimal RMSD structures | ||
num_outliers: 100 | ||
|
||
# After reading this example and trying out a few configuration changes, you should | ||
# be able to consider whether your system of interest can be cast as an RMSD | ||
# minimization problem or whether you need to make a small adjust to the inference | ||
# script to change which frames should be preffered during simulation restarts. |