Benjamin-Lee · Benjamin-Lee · Oct 13, 2020 · Oct 7, 2020 · Oct 7, 2020 · Oct 7, 2020
diff --git a/content/02.intro.md b/content/02.intro.md
@@ -9,7 +9,11 @@ Since DL is an active and specialized research area, detailed resources are rapi
 To address this issue, we solicited input from a community of researchers with varied biological and deep learning interests to write this manuscript collaboratively using the GitHub version control platform [@url:https://github.com/Benjamin-Lee/deep-rules] and Manubot [@doi:10.1371/journal.pcbi.1007128].
 
 Through the course of our discussions, several themes became clear: the importance of understanding and applying ML fundamentals [@doi:10.1186/s13040-017-0155-3] as a baseline for utilizing DL, the necessity for extensive model comparisons with careful evaluation, and the need for critical thought in interpreting results generated by means of DL, among others.
-Both DL models and the datasets to which they are applied impact prediction results, so it is important to consider both when generating biological or clinical insights from these methods.
+The major similarities between deep learning and traditional computational methods also became apparent.
+Although deep learning is a distinct subfield of machine learning, it is still a subfield.
+It is subject to the many limitations inherent to machine learning, and many best practices for machine learning also apply to deep learning.
+In addition, as with all computational methods, deep learning should be applied in a systematic manner that is reproducible and rigorously tested.
+
 Ultimately, the tips we collate range from high-level guidance to the implementation of best practices.
 It is our hope that they will provide actionable, DL-specific advice for both new and experienced DL practitioners alike who would like to employ DL in biological research.
 By increasing the accessibility of DL for applications in biological research, we aim to improve the overall quality and reporting of DL in the literature, enabling more researchers to utilize these state-of-the art modeling techniques.
diff --git a/content/03.ml-concepts.md b/content/03.ml-concepts.md
@@ -1,27 +1,51 @@
-## Tip 1: Concepts that apply to machine learning also apply to deep learning {#concepts}
+## Tip 1: Decide whether your problem is appropriate for deep learning {#appropriate}
 
-Deep learning is a distinct subfield of machine learning, but it is still a subfield.
-DL has proven to be an extremely powerful paradigm capable of outperforming “traditional” machine learning approaches in certain contexts, but it is not immune to the many limitations inherent to machine learning.
-Many best practices for machine learning also apply to deep learning.
-Like all computational methods, deep learning should be applied in a systematic manner that is reproducible and rigorously tested.
+Given the impressive accomplishments of DL in recent years and the meteoric rise in publications which rely upon it may appear that DL is capable of anything.
+Indeed, it is, at least theoretically.
+Neural networks are universal function approximators, meaning that they are in principle capable of learning any function [@doi:10.1007/BF02551274; @tag:hornik-approximation].
+If DL is so powerful and popular, why would one ever not choose to use it?
 
-Those developing deep learning models should select datasets to train and test model performance that are relevant to the problem at hand; non-salient data can hamper performance or lead to spurious conclusions.
-For example, supervised deep learning for phenotype prediction should be applied to datasets that contain large numbers of representative samples from all phenotypes to be predicted.
-Biases in testing data can also unduly influence measures of model performance, and it may be difficult to directly identify confounders from the model.
-Investigators should consider the extent to which the outcome of interest is likely to be predictable from the input data and begin by thoroughly inspecting the input data.
-Suppose that there are robust heritability estimates for a phenotype that suggest that the genetic contribution is modest but a deep learning model predicts the phenotype with very high accuracy.
-The model may be capturing a signal unrelated to the genetic mechanisms underlying the phenotype.
-In this case, a possible explanation is that people with similar genetic markers may have shared exposures.
-This is something that researchers should probe before reporting unrealistic accuracy measures.
-A similar situation can arise with tasks for which inter-rater reliability is modest but deep learning models produce very high accuracies.
-When coupled with imprudence, datasets that are confounded, biased, skewed, or of low quality will produce models of dubious performance and limited generalizability.
-Data exploration with unsupervised learning and data visualization can reveal the biases and technical artifacts in these datasets, providing a critical first step to assessing data quality before any deep learning model is applied.
-In some cases, these analyses can identify biases from known technical artifacts or sample processing which can be corrected through preprocessing techniques to support more accurate application of deep leaning models for subsequent prediction or feature identification problems from those datasets.
+The reason is simple: DL is not suited to every situation in reality.
+Training DL models requires a significant amount of data, computing power, and expertise.
+In some areas of biology where data collection is thoroughly automated, such as DNA sequencing, large amounts of quality data may be available.
+For other areas which rely on manual data collection, there may not be enough data to effectively train models.
+Though there are methods to increase the amount of training data such as data augmentation (in which existing data is slightly manipulated to yield "new" samples) and weak supervision (in which simple labeling heuristics are combined to produce noisy, probabilistic labels) [@arxiv:1605.07723v3], these methods cannot overcome a complete shortage of data.
+As a rule of thumb, DL should only be considered for datasets with at least one thousand samples, though it is best suited to cases when datasets contain orders of magnitude more samples.
 
-To evaluate deep supervised learning models, they should be trained, tuned, and tested on non-overlapping datasets.
-The data used for testing should be locked and only used one-time for evaluating the final model after all tuning steps are completed.
-Using a test set more than once will lead to biased estimates of the generalization performance [@arxiv:1811.12808; @doi:10.1162/089976698300017197].
-Additionally, many conventional metrics for classification (e.g. area under the receiver operating characteristic curve or AUROC) have limited utility in cases of extreme class imbalance [@pmid:25738806].
-Model performance should be evaluated with a carefully picked panel of relevant metrics that make minimal assumptions about the composition of the testing data [@doi:10.1021/acs.molpharmaceut.7b00578], with particular consideration given to metrics that are most directly applicable to the task at hand.
+Furthermore, training DL models can be very demanding, often requiring extensive computing infrastructure and patience to achieve state-of-the-art performance [@doi:10.1109/JPROC.2017.2761740].
+In some DL contexts, such as generating human-like text, state-of-the-art models have over one hundred billion parameters [@arxiv:2005.14165].
+Training such large models from scratch can be a costly and time-consuming undertaking [@arxiv:1906.02243].
+Luckily, most DL research in biology will not require nearly as much computation, and there are methods for reducing the amount of training required in some cases (described in [Tip 5](#architecture)).
+Specialized hardware such as discrete graphics processing units (GPUs) or custom DL accelerators can dramatically reduce the time and cost required to train models, but this hardware is not universally accessible.
+Currently, both GPU- and DL-optimized accelerator-equipped servers can be rented from cloud providers, though working with these servers adds additional cost and complexity.
+As DL becomes more popular, DL-optimized accelerators are likely to be more broadly available (for example, recent-generation iPhones already have such hardware).
+In contrast, traditional ML training can often be done on a laptop (or even a \$5 computer [@arxiv:1809.00238]) in seconds to minutes.
 
-In summary, if you are not familiar with machine learning, review a general machine learning guide such as [@doi:10.1186/s13040-017-0155-3] before diving right into deep learning.
+Beyond the necessity for greater data and computational capacity in DL, building and training DL models generally requires more expertise than traditional ML models.
+Currently, there are several competing programming frameworks for DL such as Tensorflow [@arxiv:1603.04467] and PyTorch [@arxiv:1912.01703].
+These frameworks allow users to create and deploy entirely novel model architectures and are widely used in DL research as well as in indutry.
+This flexibility combined with the rapid development of the DL field has resulted in large, complex frameworks that can be daunting to new users.
+For readers new to software development but experienced in biology, gaining computational skills while interfacing with such complex industrial-grade tools can be a challenge.
+An advantage of ML over DL is that currently there are more tools capable of automating the model selection and training process.
+Automated ML (AutoML) tools such as TPOT [@doi:10.1007/978-3-319-31204-0_9], which is capable of using genetic programming to optimize ML pipelines, and Turi Create [@https://github.com/apple/turicreate], a task-oriented ML and DL framework which automatically tests multiple ML models when training, allow users to achieve competitive performance with only a few lines of code.
+Luckily, there are efforts underway to reduce the expertise required to build and use DL models.
+Indeed, both TPOT and Turi Create, as well as other tools such as AutoKeras [@arxiv:1806.10282], are capable of abstracting away much of the programming required for "standard" DL tasks.
+Projects such as Keras [@https://keras.io], a high-level interface for TensorFlow, make it relatively straightforward to design and test custom DL architectures.
+In the future, projects such as these are likely to bring DL experimentation within reach to even more researchers.
+
+There are some types of problems in which using DL is strongly indicated over ML.
+Assuming a sufficient quantity of quality data is available, applications such as computer vision and natural language processing are likely to benefit from DL.
+Indeed, these areas were the first to see significant breakthroughs through the application of DL [@doi:10.1145/3065386] during the recent "DL revolution."
+For example, Ferreira et al. used DL to recognize individual birds from images [@doi:10.1111/2041-210X.13436].
+This problem was historically difficult but, by combining automatic data collection using RFID tags with data augmentation and transfer learning (explained in [Tip 5](#architecture), the authors were able to use DL achieve 90% accuracy in several species.
+Other areas include generative models, in which new samples are able to be created based on the training data, and reinforcement learning, in which agents are trained to interact with their environments.
+In general, before using DL, investigate whether similar problems (including analogous ones in other domains) have been solved successfully using DL.
+
+Depending on the amount and the nature of the available data, as well as the task to be performed, DL may not always be able to outperform conventional methods.
+As an illustration, Rajkomar et al. [@doi:10.1038/s41746-018-0029-1] found that simpler baseline models achieved performance comparable with that of DL in a number of clinical prediction tasks using electronic health records, which may be a surprise to many.
+Another example is provided by Koutsoukas et al., who benchmarked several traditional machine learning approaches against deep neural networks for modeling bioactivity data on moderately sized datasets [@doi:10.1186/s13321-017-0226-y].
+The researchers found that while well tuned DL approaches generally tend to outperform conventional classifiers, simple methods such as Naive Bayes classification tend to outperform DL as the noise in the dataset increases.
+Similarly, Chen et al. [@doi:s41746-019-0122-0] tested DL and a variety of traditional ML methods such as logistic regression and random forests on five different clinical datasets, finding that the non-DL methods matched or exceeded the accuracy of the DL model in all cases while requiring an order of magnitude less training time.
+
+DL is a tool and, like any other tool, must be used after consideration of its strengths and weaknesses for the problem at hand.
+Once settled upon DL as a potential solution, practitioners should follow the scientific method and compare its performance to traditional methods, as we will see next.
diff --git a/content/04.baselines.md b/content/04.baselines.md
@@ -9,11 +9,6 @@ Furthermore, in some cases, it can also be useful to combine simple baseline mod
 Such hybrid models that combine DL and simpler models can improve generalization performance, model interpretability, and confidence estimation [@arxiv:1803.04765; @arxiv:1805.11783].
 In addition, be sure to tune and compare current state-of-the-art tools (_e.g._ bioinformatics pipelines or image analysis workflows), regardless of whether they use ML, in order to gauge the relative effectiveness of your baseline and DL models.
 
-Depending on the amount and the nature of the available data, as well as the task to be performed, deep learning may not always be able to outperform conventional methods.
-As an illustration, Rajkomar et al. [@doi:10.1038/s41746-018-0029-1] found that simpler baseline models achieved performance comparable with that of DL in a number of clinical prediction tasks using electronic health records, which may be a surprise to many.
-Another example is provided by Koutsoukas et al., who benchmarked several traditional machine learning approaches against deep neural networks for modeling bioactivity data on moderately sized datasets [@doi:10.1186/s13321-017-0226-y].
-The researchers found that while well tuned deep learning approaches generally tend to outperform conventional classifiers, simple methods such as Naive Bayes classification tend to outperform deep learning as the noise in the dataset increases.
-
 It is worth noting that conventional off-the-shelf machine learning algorithms (e.g., support vector machines and random forests) are also likely to benefit from hyperparameter tuning.
 It can be tempting to train baseline models with these conventional methods using default settings, which may provide acceptable but not stellar performance, but then tune the settings for DL algorithms to further optimize performance.
 Hu and Greene [@doi:10.1142/9789813279827_0033] discuss a "Continental Breakfast Included" effect by which unequal hyperparameter tuning for different learning algorithms skews the evaluation of these methods, especially when the performance of an algorithm varies substantially with modest changes to its hyperparameters.

diff --git a/content/05.dl-complexities.md b/content/05.dl-complexities.md
@@ -3,15 +3,14 @@
 Correctly training deep neural networks is a non-trivial process.
 There are many different options and potential pitfalls at every stage.
 To get good results, you must expect to train many networks with a range of different parameter and hyperparameter settings.
-Deep learning can be very demanding, often requiring extensive computing infrastructure and patience to achieve state-of-the-art performance [@doi:10.1109/JPROC.2017.2761740].
 The experimentation inherent to DL is often noisy (requiring repetition) and represents a significant organizational challenge.
 All code, random seeds, parameters, and results must be carefully corralled using general good coding practices (for example, version control [@doi:10.1371/journal.pcbi.1004947], continuous integration etc.) in order to be effective and interpretable.
 This organization is also key to being able to efficiently share and reproduce your work [@doi:10.1371/journal.pcbi.1003285; @arxiv:1810.08055] as well as to update your model as new data becomes available.
 
 One specific reproducibility pitfall that is often missed in deep learning applications is the default use of non-deterministic algorithms by CUDA/CuDNN backends when using GPUs.
-Making this process reproducible is distinct from setting random seeds, which will primarily affect pseudorandom deterministic procedures such as shuffling and initialization, and requires explicitly specifying the use of deterministic algorithms in your DL library [@url:https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#reproducibility]. 
+Making this process reproducible is distinct from setting random seeds, which will primarily affect pseudorandom deterministic procedures such as shuffling and initialization, and requires explicitly specifying the use of deterministic algorithms in your DL library [@url:https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#reproducibility].
 
-Similar to [Tip 4](#baselines), try to start with a relatively small network and increase the size and complexity as needed to prevent wasting time and resources. 
+Similar to [Tip 4](#baselines), try to start with a relatively small network and increase the size and complexity as needed to prevent wasting time and resources.
 Beware of the seemingly trivial choices that are being made implicitly by default settings in your framework of choice e.g. choice of optimization algorithm (adaptive methods often lead to faster convergence during training but may lead to worse generalization performance on independent datasets [@url:https://papers.nips.cc/paper/7003-the-marginal-value-of-adaptive-gradient-methods-in-machine-learning]).
 These need to be carefully considered and their impacts evaluated (see [Tip 6](#hyperparameters)).