From 3e50e146ba55d5c1200f90a9c94193f2606fb109 Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Tue, 6 Oct 2020 21:48:49 -0400
Subject: [PATCH 01/21] Move tip 1 content around

---
 content/02.intro.md             |  5 +++++
 content/03.ml-concepts.md       | 28 +---------------------------
 content/06.know-your-problem.md |  9 ++++++++-
 content/09.overfitting.md       |  9 +++++++--
 content/11.interpretation.md    |  2 +-
 5 files changed, 22 insertions(+), 31 deletions(-)

diff --git a/content/02.intro.md b/content/02.intro.md
index 77d7d91c..d14b0ced 100644
--- a/content/02.intro.md
+++ b/content/02.intro.md
@@ -8,5 +8,10 @@ As DL is an active and specialized research area, detailed resources are rapidly
 To address this issue, we solicited input from a community of researchers with varied biological and deep learning interests, who wrote this manuscript collaboratively using the GitHub version control platform [@url:https://github.com/Benjamin-Lee/deep-rules] and Manubot [@url:https://greenelab.github.io/meta-review/].
 
 In the course of our discussions, several themes became clear: the importance of understanding and applying ML fundamentals [@doi:10.1186/s13040-017-0155-3] as a baseline for utilizing DL, the necessity for extensive model comparisons with careful evaluation, and the need for critical thought in interpreting results generated by means of DL, among others.
+The major similarities between deep learning and traditional computational methods also became apparent.
+Although deep learning is a distinct subfield of machine learning, it is still a subfield.
+It is subject to the many limitations inherent to machine learning, and many best practices for machine learning also apply to deep learning.
+In addition, as with all computational methods, deep learning should be applied in a systematic manner that is reproducible and rigorously tested.
+
 Ultimately, the tips we collate range from high-level guidance to the implementation of best practices, and it is our hope that they will provide actionable, DL-specific advice for both new and experienced DL practitioners alike who would like to employ DL in biological research.
 By increasing the accessibility of DL for applications in biological research, we aim to improve the overall quality and reporting of DL in the literature, enabling more researchers to utilize these state-of-the art modeling techniques.
diff --git a/content/03.ml-concepts.md b/content/03.ml-concepts.md
index 1ece80e0..53e53a68 100644
--- a/content/03.ml-concepts.md
+++ b/content/03.ml-concepts.md
@@ -1,27 +1 @@
-## Tip 1: Concepts that apply to machine learning also apply to deep learning {#concepts}
-
-Deep learning is a distinct subfield of machine learning, but it is still a subfield.
-DL has proven to be an extremely powerful paradigm capable of outperforming “traditional” machine learning approaches in certain contexts, but it is not immune to the many limitations inherent to machine learning.
-Many best practices for machine learning also apply to deep learning.
-Like all computational methods, deep learning should be applied in a systematic manner that is reproducible and rigorously tested.
-
-Those developing deep learning models should select datasets to train and test model performance that are relevant to the problem at hand; non-salient data can hamper performance or lead to spurious conclusions.
-For example, supervised deep learning for phenotype prediction should be applied to datasets that contain large numbers of representative samples from all phenotypes to be predicted.
-Biases in testing data can also unduly influence measures of model performance, and it may be difficult to directly identify confounders from the model.
-Investigators should consider the extent to which the outcome of interest is likely to be predictable from the input data and begin by thoroughly inspecting the input data.
-Suppose that there are robust heritability estimates for a phenotype that suggest that the genetic contribution is modest but a deep learning model predicts the phenotype with very high accuracy.
-The model may be capturing signal unrelated to genetic mechanisms underlying the phenotype.
-In this case, a possible explanation is that people with similar genetic markers may have shared exposures.
-This is something that researchers should probe before reporting unrealistic accuracy measures.
-A similar situation can arise with tasks for which inter-rater reliability is modest but deep learning models produce very high accuracies.
-When coupled with imprudence, datasets that are confounded, biased, skewed, or of low quality will produce models of dubious performance and limited generalizability.
-Data exploration with unsupervised learning and data visualization can reveal the biases and technical artifacts in these datasets, providing a critical first step to assessing data quality before any deep learning model is applied.
-In some cases, these analyses can identify biases from known technical artifacts or sample processing which can be corrected through preprocessing techniques to support more accurate application of deep leaning models for subsequent prediction or feature identification problems from those datasets.
-
-Using a test set more than once will lead to biased estimates of the generalization performance [@arxiv:1811.12808; @doi:10.1162/089976698300017197].
-Deep supervised learning models should be trained, tuned, and tested on non-overlapping datasets.
-The data used for testing should be locked and only used one-time for evaluating the final model after all tuning steps are completed.
-Also, many conventional metrics for classification (e.g. area under the receiver operating characteristic curve or AUROC) have limited utility in cases of extreme class imbalance [@pmid:25738806].
-Model performance should be evaluated with a carefully picked panel of relevant metrics that make minimal assumptions about the composition of the testing data [@doi:10.1021/acs.molpharmaceut.7b00578], with particular consideration given to metrics that are most directly applicable to the task at hand.
-
-In summary, if you are not familiar with machine learning, review a general machine learning guide such as [@doi:10.1186/s13040-017-0155-3] before diving right into deep learning.
+## Tip 1: Decide whether your problem is appropriate for deep learning {#appropriate}
diff --git a/content/06.know-your-problem.md b/content/06.know-your-problem.md
index ed9a8b13..fd6618e1 100644
--- a/content/06.know-your-problem.md
+++ b/content/06.know-your-problem.md
@@ -19,13 +19,20 @@ Study designs will often have different assumptions and caveats, and these canno
 Many datasets are now passively collected or do not have a specific design, but even in this case it is important to know how individuals or samples were treated.
 Samples originating from the same study site, oversampling of ethnic groups or zip codes, and sample processing differences are all sources of variation that need to be accounted for.
 
+In all cases, investigators should consider the extent to which the outcome of interest is likely to be predictable from the input data and begin by thoroughly inspecting the input data.
+Data exploration with unsupervised learning and data visualization can reveal the biases and technical artifacts in these datasets, providing a critical first step to assessing data quality before any deep learning model is applied.
+In some cases, these analyses can identify biases from known technical artifacts or sample processing which can be corrected through preprocessing techniques to support more accurate application of deep leaning models for subsequent prediction or feature identification problems from those datasets.
+
 Systematic biases, which can be induced by confounding variables, for example, can lead to artifacts or so-called "batch effects."
 As a consequence, models may learn to rely on correlations that are irrelevant in the scientific context of the study and may result in misguided predictions and misleading conclusions [@doi:10.1038/nrg2825].
 Other study design considerations that should not be overlooked include knowing whether a study involves biological or technical replicates or both.
 For example, are some samples collected from the same individuals at different time points?
 Are those time points before and after some treatment?
 If one assumes that all the samples are independent but that is in fact not the case, a variety of issues may arise, including having a lower effective sample size than expected.
-As described in [Tip 1](#concepts), unsupervised learning and other exploratory analyses can be identify such biases in these datasets prior to applying the deep learning model.
+For example, suppose that there are robust heritability estimates for a phenotype that suggest that the genetic contribution is modest but a deep learning model predicts the phenotype with very high accuracy.
+The model may be capturing a signal unrelated to the genetic mechanisms underlying the phenotype.
+In this case, a possible explanation is that people with similar genetic markers may have shared exposures.
+Researchers should this possibility probe before reporting unrealistic accuracy measures.
 
 In general, deep learning has an increased tendency for overfitting, compared to classical methods, due to the large number of parameters being estimated, making issues of adequate sample size even more important (see [Tip 7](#overfitting)).
 For a large dataset, overfitting may not be a concern, but the modeling power of deep learning may lead to more spurious correlations and thus incorrect interpretation of results (see [Tip 9](#interpretation)).
diff --git a/content/09.overfitting.md b/content/09.overfitting.md
index 483be2a3..e3c3678f 100644
--- a/content/09.overfitting.md
+++ b/content/09.overfitting.md
@@ -8,6 +8,12 @@ To continue the student analogy, a smarter student has greater potential for mem
 
 ![A visual example of overfitting and failure to generalize. While a high-degree polynomial gets high accuracy on its training data, it performs poorly on data unlike that which it has seen before. In contrast, a simple linear regression works well on both datasets. The greater representational capacity of the polynomial is analogous to using a larger or deeper neural network.](images/overfitting.png){#fig:overfitting-fig}
 
+To evaluate deep supervised learning models, they should be trained, tuned, and tested on non-overlapping datasets.
+The data used for testing should be locked and only used one-time for evaluating the final model after all tuning steps are completed.
+Using a test set more than once will lead to biased estimates of the generalization performance [@arxiv:1811.12808; @doi:10.1162/089976698300017197].
+Additionally, many conventional metrics for classification (e.g. area under the receiver operating characteristic curve or AUROC) have limited utility in cases of extreme class imbalance [@pmid:25738806].
+Model performance should be evaluated with a carefully picked panel of relevant metrics that make minimal assumptions about the composition of the testing data [@doi:10.1021/acs.molpharmaceut.7b00578], with particular consideration given to metrics that are most directly applicable to the task at hand.
+
 The simplest way to combat overfitting is to detect it.
 This can be done by splitting the dataset into three parts: a training set, a tuning set (also commonly called a validation set in the machine learning literature), and a test set.
 By exposing the model solely to the training data during fitting, a researcher can use the model's performance on the unseen test data to measure the amount of overfitting.
@@ -22,6 +28,5 @@ Similarly, when dealing with sequence data, holding out data that are evolutiona
 In these cases, simply holding out test data selected from a random partition of the training data is insufficient.
 The best remedy for confounding variables is to [know your data](#know-your-problem) and to test your model on truly independent data.
 
-
 In essence, practitioners should split data into training, tuning, and single-use testing sets to assess the performance of the model on data that can provide a reliable estimate of its generalization performance.
-Futhermore, be cognizant of the danger of skewed or biased data artificially inflating accuracy.
\ No newline at end of file
+Futhermore, be cognizant of the danger of skewed or biased data artificially inflating accuracy.
diff --git a/content/11.interpretation.md b/content/11.interpretation.md
index 2eb9c25a..ddec8b4a 100644
--- a/content/11.interpretation.md
+++ b/content/11.interpretation.md
@@ -2,7 +2,7 @@
 
 Once we have trained an accurate deep model, we often want to use it to deduce scientific findings.
 In doing so, we need to take care to correctly interpret the model's predictions.
-We know that the basic tenets of machine learning also apply to deep learning ([Tip 1](#concepts)), but because deep models can be difficult to interpret intuitively, there is a temptation to anthropomorphize the models.
+Because deep models can be difficult to interpret intuitively, there is a temptation to anthropomorphize the models.
 We must resist this temptation.
 
 A common saying in statistics classes is "correlation doesn't imply causality".

From f024a557c68cf4f4a74ad87188bf6b3ac243cc60 Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Tue, 6 Oct 2020 21:53:44 -0400
Subject: [PATCH 02/21] Merge @ajlee21 changes which I didn't resolve properly

---
 content/02.intro.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/content/02.intro.md b/content/02.intro.md
index 92c9c880..8babf657 100644
--- a/content/02.intro.md
+++ b/content/02.intro.md
@@ -1,6 +1,7 @@
 ## Introduction {#intro}
 
 Deep learning (DL) is a subfield of machine learning (ML) focusing on artificial neural networks with many layers.
+
 These methods are increasingly being used for the analysis of biological data [@doi:10.1098/rsif.2017.0387].
 In many cases, novel biological insights have been revealed through careful evaluation of DL methods ranging from predicting protein-drug binding kinetics [@doi:10.1038/s41467-017-02388-1] to identifying the lab-of-origin of synthetic DNA [@doi:10.1038/s41467-018-05378-z].
 However, for researchers and students entirely new to this area and those experienced in using classical ML methods (_e.g._ linear regression), using DL correctly can be a daunting task.
@@ -8,11 +9,12 @@ Furthermore, the lack of concise recommendations for biological applications of
 Since DL is an active and specialized research area, detailed resources are rapidly rendered obsolete, and only a few resources articulate general DL best practices to the scientific community broadly and the biological community specifically.
 To address this issue, we solicited input from a community of researchers with varied biological and deep learning interests to write this manuscript collaboratively using the GitHub version control platform [@url:https://github.com/Benjamin-Lee/deep-rules] and Manubot [@doi:10.1371/journal.pcbi.1007128].
 
-In the course of our discussions, several themes became clear: the importance of understanding and applying ML fundamentals [@doi:10.1186/s13040-017-0155-3] as a baseline for utilizing DL, the necessity for extensive model comparisons with careful evaluation, and the need for critical thought in interpreting results generated by means of DL, among others.
+Through the course of our discussions, several themes became clear: the importance of understanding and applying ML fundamentals [@doi:10.1186/s13040-017-0155-3] as a baseline for utilizing DL, the necessity for extensive model comparisons with careful evaluation, and the need for critical thought in interpreting results generated by means of DL, among others.
 The major similarities between deep learning and traditional computational methods also became apparent.
 Although deep learning is a distinct subfield of machine learning, it is still a subfield.
 It is subject to the many limitations inherent to machine learning, and many best practices for machine learning also apply to deep learning.
 In addition, as with all computational methods, deep learning should be applied in a systematic manner that is reproducible and rigorously tested.
 
-Ultimately, the tips we collate range from high-level guidance to the implementation of best practices, and it is our hope that they will provide actionable, DL-specific advice for both new and experienced DL practitioners alike who would like to employ DL in biological research.
+Ultimately, the tips we collate range from high-level guidance to the implementation of best practices.
+It is our hope that they will provide actionable, DL-specific advice for both new and experienced DL practitioners alike who would like to employ DL in biological research.
 By increasing the accessibility of DL for applications in biological research, we aim to improve the overall quality and reporting of DL in the literature, enabling more researchers to utilize these state-of-the art modeling techniques.

From 8cd4486f35d50eb638062016c243f1e0b30042e8 Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Tue, 6 Oct 2020 21:54:16 -0400
Subject: [PATCH 03/21] Delete an extra line break

---
 content/02.intro.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/content/02.intro.md b/content/02.intro.md
index 8babf657..682e3db5 100644
--- a/content/02.intro.md
+++ b/content/02.intro.md
@@ -1,7 +1,6 @@
 ## Introduction {#intro}
 
 Deep learning (DL) is a subfield of machine learning (ML) focusing on artificial neural networks with many layers.
-
 These methods are increasingly being used for the analysis of biological data [@doi:10.1098/rsif.2017.0387].
 In many cases, novel biological insights have been revealed through careful evaluation of DL methods ranging from predicting protein-drug binding kinetics [@doi:10.1038/s41467-017-02388-1] to identifying the lab-of-origin of synthetic DNA [@doi:10.1038/s41467-018-05378-z].
 However, for researchers and students entirely new to this area and those experienced in using classical ML methods (_e.g._ linear regression), using DL correctly can be a daunting task.

From 11f78319dc3e6e59c6855ec228071bebaad53090 Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Thu, 8 Oct 2020 20:01:54 -0400
Subject: [PATCH 04/21] Remove heritability example paragraph per #242

---
 content/06.know-your-problem.md | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/content/06.know-your-problem.md b/content/06.know-your-problem.md
index fd6618e1..a51cebee 100644
--- a/content/06.know-your-problem.md
+++ b/content/06.know-your-problem.md
@@ -29,10 +29,6 @@ Other study design considerations that should not be overlooked include knowing
 For example, are some samples collected from the same individuals at different time points?
 Are those time points before and after some treatment?
 If one assumes that all the samples are independent but that is in fact not the case, a variety of issues may arise, including having a lower effective sample size than expected.
-For example, suppose that there are robust heritability estimates for a phenotype that suggest that the genetic contribution is modest but a deep learning model predicts the phenotype with very high accuracy.
-The model may be capturing a signal unrelated to the genetic mechanisms underlying the phenotype.
-In this case, a possible explanation is that people with similar genetic markers may have shared exposures.
-Researchers should this possibility probe before reporting unrealistic accuracy measures.
 
 In general, deep learning has an increased tendency for overfitting, compared to classical methods, due to the large number of parameters being estimated, making issues of adequate sample size even more important (see [Tip 7](#overfitting)).
 For a large dataset, overfitting may not be a concern, but the modeling power of deep learning may lead to more spurious correlations and thus incorrect interpretation of results (see [Tip 9](#interpretation)).

From c7803e82706ba7a13a4529d0adb63c186fc75033 Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Thu, 8 Oct 2020 22:08:19 -0400
Subject: [PATCH 05/21] Move some content into the new first tip

---
 content/03.ml-concepts.md | 19 ++++++++++++++++++-
 content/04.baselines.md   |  5 -----
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/content/03.ml-concepts.md b/content/03.ml-concepts.md
index 84e3a463..fbb9abcf 100644
--- a/content/03.ml-concepts.md
+++ b/content/03.ml-concepts.md
@@ -1 +1,18 @@
-## Tip 1: Decide whether your problem is appropriate for deep learning {#appropriate}
\ No newline at end of file
+## Tip 1: Decide whether your problem is appropriate for deep learning {#appropriate}
+
+Given the impressive accomplishments of DL in recent years and the meteoric rise in publications which rely upon it may appear that DL is capable of anything.
+Indeed, it is, at least theoretically.
+Neural networks are universal function approximators, meaning that they are in principle capable of learning any function [@doi:10.1007/BF02551274; @tag:hornik-approximation].
+If DL is so powerful and popular, why would one ever not choose to use it?
+
+The reason is simple: DL is not suited to every situation.
+Training DL models requires a significant amount of data, computing power, and expertise.
+In some areas of biology where data collection is thoroughly automated, such as DNA sequencing, large amounts of quality data may be available.
+For other areas which rely on manual data collection, there may not be enough data to effectively train models.
+As a rule of thumb, DL should only be considered for datasets with at least ten thousand samples, though it is best suited to cases when datasets contain orders of magnitude more samples.
+
+Depending on the amount and the nature of the available data, as well as the task to be performed, deep learning may not always be able to outperform conventional methods.
+As an illustration, Rajkomar et al. [@doi:10.1038/s41746-018-0029-1] found that simpler baseline models achieved performance comparable with that of DL in a number of clinical prediction tasks using electronic health records, which may be a surprise to many.
+Another example is provided by Koutsoukas et al., who benchmarked several traditional machine learning approaches against deep neural networks for modeling bioactivity data on moderately sized datasets [@doi:10.1186/s13321-017-0226-y].
+The researchers found that while well tuned deep learning approaches generally tend to outperform conventional classifiers, simple methods such as Naive Bayes classification tend to outperform deep learning as the noise in the dataset increases.
+Similarly, Chen et al. [@doi:s41746-019-0122-0] tested deep learning and a variety of traditional ML methods such as logistic regression and random forests on five different clinical datasets, finding that the non-DL methods matched or exceeded the accuracy of the DL model in all cases while requiring an order of magnitude less training time.
diff --git a/content/04.baselines.md b/content/04.baselines.md
index 23f71f8f..470dbf97 100644
--- a/content/04.baselines.md
+++ b/content/04.baselines.md
@@ -9,11 +9,6 @@ Furthermore, in some cases, it can also be useful to combine simple baseline mod
 Such hybrid models that combine DL and simpler models can improve generalization performance, model interpretability, and confidence estimation [@arxiv:1803.04765; @arxiv:1805.11783].
 In addition, be sure to tune and compare current state-of-the-art tools (_e.g._ bioinformatics pipelines or image analysis workflows), regardless of whether they use ML, in order to gauge the relative effectiveness of your baseline and DL models.
 
-Depending on the amount and the nature of the available data, as well as the task to be performed, deep learning may not always be able to outperform conventional methods.
-As an illustration, Rajkomar et al. [@doi:10.1038/s41746-018-0029-1] found that simpler baseline models achieved performance comparable with that of DL in a number of clinical prediction tasks using electronic health records, which may be a surprise to many.
-Another example is provided by Koutsoukas et al., who benchmarked several traditional machine learning approaches against deep neural networks for modeling bioactivity data on moderately sized datasets [@doi:10.1186/s13321-017-0226-y].
-The researchers found that while well tuned deep learning approaches generally tend to outperform conventional classifiers, simple methods such as Naive Bayes classification tend to outperform deep learning as the noise in the dataset increases.
-
 It is worth noting that conventional off-the-shelf machine learning algorithms (e.g., support vector machines and random forests) are also likely to benefit from hyperparameter tuning.
 It can be tempting to train baseline models with these conventional methods using default settings, which may provide acceptable but not stellar performance, but then tune the settings for DL algorithms to further optimize performance.
 Hu and Greene [@doi:10.1142/9789813279827_0033] discuss a "Continental Breakfast Included" effect by which unequal hyperparameter tuning for different learning algorithms skews the evaluation of these methods, especially when the performance of an algorithm varies substantially with modest changes to its hyperparameters.

From 23cc9dbb9c21b274776346da44351a2999f66965 Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Fri, 9 Oct 2020 14:30:59 -0400
Subject: [PATCH 06/21] Clarify that DL is not suited to every sitaution IRL

---
 content/03.ml-concepts.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/03.ml-concepts.md b/content/03.ml-concepts.md
index fbb9abcf..6cb2a059 100644
--- a/content/03.ml-concepts.md
+++ b/content/03.ml-concepts.md
@@ -5,7 +5,7 @@ Indeed, it is, at least theoretically.
 Neural networks are universal function approximators, meaning that they are in principle capable of learning any function [@doi:10.1007/BF02551274; @tag:hornik-approximation].
 If DL is so powerful and popular, why would one ever not choose to use it?
 
-The reason is simple: DL is not suited to every situation.
+The reason is simple: DL is not suited to every situation in reality.
 Training DL models requires a significant amount of data, computing power, and expertise.
 In some areas of biology where data collection is thoroughly automated, such as DNA sequencing, large amounts of quality data may be available.
 For other areas which rely on manual data collection, there may not be enough data to effectively train models.

From 7848122216300887a2042702dbdb58e2f7acca77 Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Fri, 9 Oct 2020 15:10:50 -0400
Subject: [PATCH 07/21] Mention methods for increasing training data

---
 content/03.ml-concepts.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/content/03.ml-concepts.md b/content/03.ml-concepts.md
index 6cb2a059..57c5638d 100644
--- a/content/03.ml-concepts.md
+++ b/content/03.ml-concepts.md
@@ -9,7 +9,8 @@ The reason is simple: DL is not suited to every situation in reality.
 Training DL models requires a significant amount of data, computing power, and expertise.
 In some areas of biology where data collection is thoroughly automated, such as DNA sequencing, large amounts of quality data may be available.
 For other areas which rely on manual data collection, there may not be enough data to effectively train models.
-As a rule of thumb, DL should only be considered for datasets with at least ten thousand samples, though it is best suited to cases when datasets contain orders of magnitude more samples.
+Though there are methods to increase the amount of training data such as data augmentation (in which existing data is slightly manipulated to yield "new" samples) and weak supervision (in which simple labeling heuristics are combined to produce noisy, probabilistic labels) [@arxiv:1605.07723v3], these methods cannot overcome a complete shortage of data.
+As a rule of thumb, DL should only be considered for datasets with at least one thousand samples, though it is best suited to cases when datasets contain orders of magnitude more samples.
 
 Depending on the amount and the nature of the available data, as well as the task to be performed, deep learning may not always be able to outperform conventional methods.
 As an illustration, Rajkomar et al. [@doi:10.1038/s41746-018-0029-1] found that simpler baseline models achieved performance comparable with that of DL in a number of clinical prediction tasks using electronic health records, which may be a surprise to many.

From 06c8dbe103adacacb7d2c4e7adf2526015aa3174 Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Fri, 9 Oct 2020 17:16:17 -0400
Subject: [PATCH 08/21] Add a paragraph about computing resources required

---
 content/03.ml-concepts.md     | 9 +++++++++
 content/05.dl-complexities.md | 5 ++---
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/content/03.ml-concepts.md b/content/03.ml-concepts.md
index 57c5638d..522d38a4 100644
--- a/content/03.ml-concepts.md
+++ b/content/03.ml-concepts.md
@@ -12,6 +12,15 @@ For other areas which rely on manual data collection, there may not be enough da
 Though there are methods to increase the amount of training data such as data augmentation (in which existing data is slightly manipulated to yield "new" samples) and weak supervision (in which simple labeling heuristics are combined to produce noisy, probabilistic labels) [@arxiv:1605.07723v3], these methods cannot overcome a complete shortage of data.
 As a rule of thumb, DL should only be considered for datasets with at least one thousand samples, though it is best suited to cases when datasets contain orders of magnitude more samples.
 
+Furthermore, training DL models can be very demanding, often requiring extensive computing infrastructure and patience to achieve state-of-the-art performance [@doi:10.1109/JPROC.2017.2761740].
+In some DL contexts, such as generating human-like text, state-of-the-art models have over one hundred billion parameters [@arxiv:2005.14165].
+Training such large models from scratch can be a costly and time-consuming undertaking [@arxiv:1906.02243].
+Luckily, most DL research in biology will not require nearly as much computation, and there are methods for reducing the amount of training required in some cases (described in [Tip 5](#architecture)).
+Specialized hardware such as discrete graphics processing units (GPUs) or custom DL accelerators can dramatically reduce the time and cost required to train models, but this hardware is not universally accessible.
+Currently, both GPU- and DL-optimized accelerator-equipped servers can be rented from cloud providers, though working with these servers adds additional cost and complexity.
+As DL becomes more popular, DL-optimized accelerators are likely to be more broadly available (for example, recent-generation iPhones already have such hardware).
+In contrast, traditional ML training can often be done on a laptop (or even a \$5 computer [@arxiv:1809.00238]) in seconds to minutes.
+
 Depending on the amount and the nature of the available data, as well as the task to be performed, deep learning may not always be able to outperform conventional methods.
 As an illustration, Rajkomar et al. [@doi:10.1038/s41746-018-0029-1] found that simpler baseline models achieved performance comparable with that of DL in a number of clinical prediction tasks using electronic health records, which may be a surprise to many.
 Another example is provided by Koutsoukas et al., who benchmarked several traditional machine learning approaches against deep neural networks for modeling bioactivity data on moderately sized datasets [@doi:10.1186/s13321-017-0226-y].
diff --git a/content/05.dl-complexities.md b/content/05.dl-complexities.md
index cf495e20..2dc3230f 100644
--- a/content/05.dl-complexities.md
+++ b/content/05.dl-complexities.md
@@ -3,15 +3,14 @@
 Correctly training deep neural networks is a non-trivial process.
 There are many different options and potential pitfalls at every stage.
 To get good results, you must expect to train many networks with a range of different parameter and hyperparameter settings.
-Deep learning can be very demanding, often requiring extensive computing infrastructure and patience to achieve state-of-the-art performance [@doi:10.1109/JPROC.2017.2761740].
 The experimentation inherent to DL is often noisy (requiring repetition) and represents a significant organizational challenge.
 All code, random seeds, parameters, and results must be carefully corralled using general good coding practices (for example, version control [@doi:10.1371/journal.pcbi.1004947], continuous integration etc.) in order to be effective and interpretable.
 This organization is also key to being able to efficiently share and reproduce your work [@doi:10.1371/journal.pcbi.1003285; @arxiv:1810.08055] as well as to update your model as new data becomes available.
 
 One specific reproducibility pitfall that is often missed in deep learning applications is the default use of non-deterministic algorithms by CUDA/CuDNN backends when using GPUs.
-Making this process reproducible is distinct from setting random seeds, which will primarily affect pseudorandom deterministic procedures such as shuffling and initialization, and requires explicitly specifying the use of deterministic algorithms in your DL library [@url:https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#reproducibility]. 
+Making this process reproducible is distinct from setting random seeds, which will primarily affect pseudorandom deterministic procedures such as shuffling and initialization, and requires explicitly specifying the use of deterministic algorithms in your DL library [@url:https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#reproducibility].
 
-Similar to [Tip 4](#baselines), try to start with a relatively small network and increase the size and complexity as needed to prevent wasting time and resources. 
+Similar to [Tip 4](#baselines), try to start with a relatively small network and increase the size and complexity as needed to prevent wasting time and resources.
 Beware of the seemingly trivial choices that are being made implicitly by default settings in your framework of choice e.g. choice of optimization algorithm (adaptive methods often lead to faster convergence during training but may lead to worse generalization performance on independent datasets [@url:https://papers.nips.cc/paper/7003-the-marginal-value-of-adaptive-gradient-methods-in-machine-learning]).
 These need to be carefully considered and their impacts evaluated (see [Tip 6](#hyperparameters)).
 

From 9e5f4e430b8cf46ef3689c6264a3935be73454de Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Sat, 10 Oct 2020 17:39:12 -0400
Subject: [PATCH 09/21] Add a paragraph on expertise required

---
 content/03.ml-concepts.md | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/content/03.ml-concepts.md b/content/03.ml-concepts.md
index 522d38a4..639559ea 100644
--- a/content/03.ml-concepts.md
+++ b/content/03.ml-concepts.md
@@ -21,6 +21,18 @@ Currently, both GPU- and DL-optimized accelerator-equipped servers can be rented
 As DL becomes more popular, DL-optimized accelerators are likely to be more broadly available (for example, recent-generation iPhones already have such hardware).
 In contrast, traditional ML training can often be done on a laptop (or even a \$5 computer [@arxiv:1809.00238]) in seconds to minutes.
 
+Beyond the necessity for greater data and computational capacity in DL, building and training DL models generally requires more expertise than traditional ML models.
+Currently, there are several competing programming frameworks for DL such as Tensorflow [@arxiv:1603.04467] and PyTorch [@arxiv:1912.01703].
+These frameworks allow users to create and deploy entirely novel model architectures and are widely used in DL research as well as in indutry.
+This flexibility combined with the rapid development of the DL field has resulted in large, complex frameworks that can be daunting to new users.
+For readers new to software development but experienced in biology, gaining computational skills while interfacing with such complex industrial-grade tools can be a challenge.
+An advantage of ML over DL is that currently there are more tools capable of automating the model selection and training process.
+Automated ML (AutoML) tools such as TPOT [@doi:10.1007/978-3-319-31204-0_9], which is capable of using genetic programming to optimize ML pipelines, and Turi Create [@https://github.com/apple/turicreate], a task-oriented ML and DL framework which automatically tests multiple ML models when training, allow users to achieve competitive performance with only a few lines of code.
+Luckily, there are efforts underway to reduce the expertise required to build and use DL models.
+Indeed, both TPOT and Turi Create, as well as other tools such as AutoKeras [@arxiv:1806.10282], are capable of abstracting away much of the programming required for "standard" DL tasks.
+Projects such as Keras [@https://keras.io], a high-level interface for TensorFlow, make it relatively straightforward to design and test custom DL architectures.
+In the future, projects such as these are likely to bring DL experimentation within reach to even more researchers.
+
 Depending on the amount and the nature of the available data, as well as the task to be performed, deep learning may not always be able to outperform conventional methods.
 As an illustration, Rajkomar et al. [@doi:10.1038/s41746-018-0029-1] found that simpler baseline models achieved performance comparable with that of DL in a number of clinical prediction tasks using electronic health records, which may be a surprise to many.
 Another example is provided by Koutsoukas et al., who benchmarked several traditional machine learning approaches against deep neural networks for modeling bioactivity data on moderately sized datasets [@doi:10.1186/s13321-017-0226-y].

From a8c90cd268a0b7eb977f13ba85f16b41da51ac23 Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Sat, 10 Oct 2020 20:43:35 -0400
Subject: [PATCH 10/21] Add a paragraph on when to use DL

---
 content/03.ml-concepts.md | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/content/03.ml-concepts.md b/content/03.ml-concepts.md
index 639559ea..4baf93db 100644
--- a/content/03.ml-concepts.md
+++ b/content/03.ml-concepts.md
@@ -33,8 +33,19 @@ Indeed, both TPOT and Turi Create, as well as other tools such as AutoKeras [@ar
 Projects such as Keras [@https://keras.io], a high-level interface for TensorFlow, make it relatively straightforward to design and test custom DL architectures.
 In the future, projects such as these are likely to bring DL experimentation within reach to even more researchers.
 
-Depending on the amount and the nature of the available data, as well as the task to be performed, deep learning may not always be able to outperform conventional methods.
+There are some types of problems in which using DL is strongly indicated over ML.
+Assuming a sufficient quantity of quality data is available, applications such as computer vision and natural language processing are likely to benefit from DL.
+Indeed, these areas were the first to see significant breakthroughs through the application of DL [@doi:10.1145/3065386] during the recent "DL revolution."
+For example, Ferreira et al. used DL to recognize individual birds from images [@doi:10.1111/2041-210X.13436].
+This problem was historically difficult but, by combining automatic data collection using RFID tags with data augmentation and transfer learning (explained in [Tip 5](#architecture), the authors were able to use DL achieve 90% accuracy in several species.
+Other areas include generative models, in which new samples are able to be created based on the training data, and reinforcement learning, in which agents are trained to interact with their environments.
+In general, before using DL, investigate whether similar problems (including analogous ones in other domains) have been solved successfully using DL.
+
+Depending on the amount and the nature of the available data, as well as the task to be performed, DL may not always be able to outperform conventional methods.
 As an illustration, Rajkomar et al. [@doi:10.1038/s41746-018-0029-1] found that simpler baseline models achieved performance comparable with that of DL in a number of clinical prediction tasks using electronic health records, which may be a surprise to many.
 Another example is provided by Koutsoukas et al., who benchmarked several traditional machine learning approaches against deep neural networks for modeling bioactivity data on moderately sized datasets [@doi:10.1186/s13321-017-0226-y].
-The researchers found that while well tuned deep learning approaches generally tend to outperform conventional classifiers, simple methods such as Naive Bayes classification tend to outperform deep learning as the noise in the dataset increases.
-Similarly, Chen et al. [@doi:s41746-019-0122-0] tested deep learning and a variety of traditional ML methods such as logistic regression and random forests on five different clinical datasets, finding that the non-DL methods matched or exceeded the accuracy of the DL model in all cases while requiring an order of magnitude less training time.
+The researchers found that while well tuned DL approaches generally tend to outperform conventional classifiers, simple methods such as Naive Bayes classification tend to outperform DL as the noise in the dataset increases.
+Similarly, Chen et al. [@doi:s41746-019-0122-0] tested DL and a variety of traditional ML methods such as logistic regression and random forests on five different clinical datasets, finding that the non-DL methods matched or exceeded the accuracy of the DL model in all cases while requiring an order of magnitude less training time.
+
+DL is a tool and, like any other tool, must be used after consideration of its strengths and weaknesses for the problem at hand.
+Once settled upon DL as a potential solution, practitioners should follow the scientific method and compare its performance to traditional methods, as we will see next.

From 06322b119c28dd47638caab052c8d5812637aa98 Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Sat, 10 Oct 2020 22:00:16 -0400
Subject: [PATCH 11/21] Merge @chevrm suggestions for the opening paragraph

---
 content/03.ml-concepts.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/content/03.ml-concepts.md b/content/03.ml-concepts.md
index 4baf93db..f7cd0d3d 100644
--- a/content/03.ml-concepts.md
+++ b/content/03.ml-concepts.md
@@ -1,8 +1,8 @@
 ## Tip 1: Decide whether your problem is appropriate for deep learning {#appropriate}
 
-Given the impressive accomplishments of DL in recent years and the meteoric rise in publications which rely upon it may appear that DL is capable of anything.
-Indeed, it is, at least theoretically.
-Neural networks are universal function approximators, meaning that they are in principle capable of learning any function [@doi:10.1007/BF02551274; @tag:hornik-approximation].
+In recent years, the number of publications implementing DL in biology have risen tremendously.
+Given DL's usefulness across a range of scientific questions and data modalities, it may appear that it is capable of anything.
+Indeed, neural networks are universal function approximators, meaning that they are in principle capable of learning any function [@doi:10.1007/BF02551274; @tag:hornik-approximation].
 If DL is so powerful and popular, why would one ever not choose to use it?
 
 The reason is simple: DL is not suited to every situation in reality.
@@ -35,7 +35,7 @@ In the future, projects such as these are likely to bring DL experimentation wit
 
 There are some types of problems in which using DL is strongly indicated over ML.
 Assuming a sufficient quantity of quality data is available, applications such as computer vision and natural language processing are likely to benefit from DL.
-Indeed, these areas were the first to see significant breakthroughs through the application of DL [@doi:10.1145/3065386] during the recent "DL revolution."
+In fact, these areas were the first to see significant breakthroughs through the application of DL [@doi:10.1145/3065386] during the recent "DL revolution."
 For example, Ferreira et al. used DL to recognize individual birds from images [@doi:10.1111/2041-210X.13436].
 This problem was historically difficult but, by combining automatic data collection using RFID tags with data augmentation and transfer learning (explained in [Tip 5](#architecture), the authors were able to use DL achieve 90% accuracy in several species.
 Other areas include generative models, in which new samples are able to be created based on the training data, and reinforcement learning, in which agents are trained to interact with their environments.

From c378eb6cce70b0daf8903504e1479474f831f244 Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Sat, 10 Oct 2020 22:20:48 -0400
Subject: [PATCH 12/21] Reword second sentence

---
 content/03.ml-concepts.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/03.ml-concepts.md b/content/03.ml-concepts.md
index f7cd0d3d..cdc4aee7 100644
--- a/content/03.ml-concepts.md
+++ b/content/03.ml-concepts.md
@@ -1,7 +1,7 @@
 ## Tip 1: Decide whether your problem is appropriate for deep learning {#appropriate}
 
 In recent years, the number of publications implementing DL in biology have risen tremendously.
-Given DL's usefulness across a range of scientific questions and data modalities, it may appear that it is capable of anything.
+Given DL's usefulness across a range of scientific questions and data modalities, it may appear that it is a panacea for modeling problems.
 Indeed, neural networks are universal function approximators, meaning that they are in principle capable of learning any function [@doi:10.1007/BF02551274; @tag:hornik-approximation].
 If DL is so powerful and popular, why would one ever not choose to use it?
 

From 9a5df55ea7571f32e7b6722c29047ff05ba8f137 Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Sun, 11 Oct 2020 22:34:26 -0400
Subject: [PATCH 13/21] Rename the file to reflect the new content

---
 content/{03.ml-concepts.md => 03.when-to-use-dl.md} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename content/{03.ml-concepts.md => 03.when-to-use-dl.md} (100%)

diff --git a/content/03.ml-concepts.md b/content/03.when-to-use-dl.md
similarity index 100%
rename from content/03.ml-concepts.md
rename to content/03.when-to-use-dl.md

From 937cc0860ce0f479666ccd1a2e9eb8d8815f6492 Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Mon, 12 Oct 2020 14:58:03 -0400
Subject: [PATCH 14/21] Rename tip 1

---
 content/03.when-to-use-dl.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/03.when-to-use-dl.md b/content/03.when-to-use-dl.md
index cdc4aee7..81694ac0 100644
--- a/content/03.when-to-use-dl.md
+++ b/content/03.when-to-use-dl.md
@@ -1,4 +1,4 @@
-## Tip 1: Decide whether your problem is appropriate for deep learning {#appropriate}
+## Tip 1: Decide whether deep learning is appropriate for your problem {#appropriate}
 
 In recent years, the number of publications implementing DL in biology have risen tremendously.
 Given DL's usefulness across a range of scientific questions and data modalities, it may appear that it is a panacea for modeling problems.

From 6c4f4d162c865a129fb69da1f057b25299946d8e Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Mon, 12 Oct 2020 17:07:24 -0400
Subject: [PATCH 15/21] indutry -> industrial applications (@siminab)

---
 content/03.when-to-use-dl.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/03.when-to-use-dl.md b/content/03.when-to-use-dl.md
index 81694ac0..81ed1f22 100644
--- a/content/03.when-to-use-dl.md
+++ b/content/03.when-to-use-dl.md
@@ -23,7 +23,7 @@ In contrast, traditional ML training can often be done on a laptop (or even a \$
 
 Beyond the necessity for greater data and computational capacity in DL, building and training DL models generally requires more expertise than traditional ML models.
 Currently, there are several competing programming frameworks for DL such as Tensorflow [@arxiv:1603.04467] and PyTorch [@arxiv:1912.01703].
-These frameworks allow users to create and deploy entirely novel model architectures and are widely used in DL research as well as in indutry.
+These frameworks allow users to create and deploy entirely novel model architectures and are widely used in DL research as well as in industrial applications.
 This flexibility combined with the rapid development of the DL field has resulted in large, complex frameworks that can be daunting to new users.
 For readers new to software development but experienced in biology, gaining computational skills while interfacing with such complex industrial-grade tools can be a challenge.
 An advantage of ML over DL is that currently there are more tools capable of automating the model selection and training process.

From 4f53912d38afab893f893798b4c8fb4849844ddc Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Mon, 12 Oct 2020 19:27:45 -0400
Subject: [PATCH 16/21] Apply suggestions from @ajlee21 for tip 1

Co-authored-by: Alexandra Lee <alexjlee.21@gmail.com>
---
 content/03.when-to-use-dl.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/content/03.when-to-use-dl.md b/content/03.when-to-use-dl.md
index 81ed1f22..b272badd 100644
--- a/content/03.when-to-use-dl.md
+++ b/content/03.when-to-use-dl.md
@@ -9,7 +9,7 @@ The reason is simple: DL is not suited to every situation in reality.
 Training DL models requires a significant amount of data, computing power, and expertise.
 In some areas of biology where data collection is thoroughly automated, such as DNA sequencing, large amounts of quality data may be available.
 For other areas which rely on manual data collection, there may not be enough data to effectively train models.
-Though there are methods to increase the amount of training data such as data augmentation (in which existing data is slightly manipulated to yield "new" samples) and weak supervision (in which simple labeling heuristics are combined to produce noisy, probabilistic labels) [@arxiv:1605.07723v3], these methods cannot overcome a complete shortage of data.
+Though there are methods to increase the amount of training data, such as data augmentation (in which existing data is slightly manipulated to yield "new" samples) and weak supervision (in which simple labeling heuristics are combined to produce noisy, probabilistic labels) [@arxiv:1605.07723v3], these methods cannot overcome a complete shortage of data.
 As a rule of thumb, DL should only be considered for datasets with at least one thousand samples, though it is best suited to cases when datasets contain orders of magnitude more samples.
 
 Furthermore, training DL models can be very demanding, often requiring extensive computing infrastructure and patience to achieve state-of-the-art performance [@doi:10.1109/JPROC.2017.2761740].
@@ -37,7 +37,7 @@ There are some types of problems in which using DL is strongly indicated over ML
 Assuming a sufficient quantity of quality data is available, applications such as computer vision and natural language processing are likely to benefit from DL.
 In fact, these areas were the first to see significant breakthroughs through the application of DL [@doi:10.1145/3065386] during the recent "DL revolution."
 For example, Ferreira et al. used DL to recognize individual birds from images [@doi:10.1111/2041-210X.13436].
-This problem was historically difficult but, by combining automatic data collection using RFID tags with data augmentation and transfer learning (explained in [Tip 5](#architecture), the authors were able to use DL achieve 90% accuracy in several species.
+This problem was historically difficult but, by combining automatic data collection using RFID tags with data augmentation and transfer learning (explained in [Tip 5](#architecture)), the authors were able to use DL achieve 90% accuracy in several species.
 Other areas include generative models, in which new samples are able to be created based on the training data, and reinforcement learning, in which agents are trained to interact with their environments.
 In general, before using DL, investigate whether similar problems (including analogous ones in other domains) have been solved successfully using DL.
 
@@ -47,5 +47,5 @@ Another example is provided by Koutsoukas et al., who benchmarked several tradit
 The researchers found that while well tuned DL approaches generally tend to outperform conventional classifiers, simple methods such as Naive Bayes classification tend to outperform DL as the noise in the dataset increases.
 Similarly, Chen et al. [@doi:s41746-019-0122-0] tested DL and a variety of traditional ML methods such as logistic regression and random forests on five different clinical datasets, finding that the non-DL methods matched or exceeded the accuracy of the DL model in all cases while requiring an order of magnitude less training time.
 
-DL is a tool and, like any other tool, must be used after consideration of its strengths and weaknesses for the problem at hand.
+In conclusion, deep learning is a tool and, like any other tool, must be used after consideration of its strengths and weaknesses for the problem at hand.
 Once settled upon DL as a potential solution, practitioners should follow the scientific method and compare its performance to traditional methods, as we will see next.

From 9313f88d5f4698677ec81bb596779e1895a8f90e Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Mon, 12 Oct 2020 19:42:24 -0400
Subject: [PATCH 17/21] ML -> machine learning and DL -> deep learning

---
 content/03.when-to-use-dl.md | 68 ++++++++++++++++++------------------
 1 file changed, 34 insertions(+), 34 deletions(-)

diff --git a/content/03.when-to-use-dl.md b/content/03.when-to-use-dl.md
index b272badd..292351d6 100644
--- a/content/03.when-to-use-dl.md
+++ b/content/03.when-to-use-dl.md
@@ -1,51 +1,51 @@
 ## Tip 1: Decide whether deep learning is appropriate for your problem {#appropriate}
 
-In recent years, the number of publications implementing DL in biology have risen tremendously.
-Given DL's usefulness across a range of scientific questions and data modalities, it may appear that it is a panacea for modeling problems.
+In recent years, the number of publications implementing deep learning in biology have risen tremendously.
+Given deep learning's usefulness across a range of scientific questions and data modalities, it may appear that it is a panacea for modeling problems.
 Indeed, neural networks are universal function approximators, meaning that they are in principle capable of learning any function [@doi:10.1007/BF02551274; @tag:hornik-approximation].
-If DL is so powerful and popular, why would one ever not choose to use it?
+If deep learning is so powerful and popular, why would one ever not choose to use it?
 
-The reason is simple: DL is not suited to every situation in reality.
-Training DL models requires a significant amount of data, computing power, and expertise.
+The reason is simple: deep learning is not suited to every situation in reality.
+Training deep learning models requires a significant amount of data, computing power, and expertise.
 In some areas of biology where data collection is thoroughly automated, such as DNA sequencing, large amounts of quality data may be available.
 For other areas which rely on manual data collection, there may not be enough data to effectively train models.
 Though there are methods to increase the amount of training data, such as data augmentation (in which existing data is slightly manipulated to yield "new" samples) and weak supervision (in which simple labeling heuristics are combined to produce noisy, probabilistic labels) [@arxiv:1605.07723v3], these methods cannot overcome a complete shortage of data.
-As a rule of thumb, DL should only be considered for datasets with at least one thousand samples, though it is best suited to cases when datasets contain orders of magnitude more samples.
+As a rule of thumb, deep learning should only be considered for datasets with at least one thousand samples, though it is best suited to cases when datasets contain orders of magnitude more samples.
 
-Furthermore, training DL models can be very demanding, often requiring extensive computing infrastructure and patience to achieve state-of-the-art performance [@doi:10.1109/JPROC.2017.2761740].
-In some DL contexts, such as generating human-like text, state-of-the-art models have over one hundred billion parameters [@arxiv:2005.14165].
+Furthermore, training deep learning models can be very demanding, often requiring extensive computing infrastructure and patience to achieve state-of-the-art performance [@doi:10.1109/JPROC.2017.2761740].
+In some deep learning contexts, such as generating human-like text, state-of-the-art models have over one hundred billion parameters [@arxiv:2005.14165].
 Training such large models from scratch can be a costly and time-consuming undertaking [@arxiv:1906.02243].
-Luckily, most DL research in biology will not require nearly as much computation, and there are methods for reducing the amount of training required in some cases (described in [Tip 5](#architecture)).
-Specialized hardware such as discrete graphics processing units (GPUs) or custom DL accelerators can dramatically reduce the time and cost required to train models, but this hardware is not universally accessible.
-Currently, both GPU- and DL-optimized accelerator-equipped servers can be rented from cloud providers, though working with these servers adds additional cost and complexity.
-As DL becomes more popular, DL-optimized accelerators are likely to be more broadly available (for example, recent-generation iPhones already have such hardware).
-In contrast, traditional ML training can often be done on a laptop (or even a \$5 computer [@arxiv:1809.00238]) in seconds to minutes.
+Luckily, most deep learning research in biology will not require nearly as much computation, and there are methods for reducing the amount of training required in some cases (described in [Tip 5](#architecture)).
+Specialized hardware such as discrete graphics processing units (GPUs) or custom deep learning accelerators can dramatically reduce the time and cost required to train models, but this hardware is not universally accessible.
+Currently, both GPU- and deep learning-optimized accelerator-equipped servers can be rented from cloud providers, though working with these servers adds additional cost and complexity.
+As deep learning becomes more popular, these accelerators are likely to be more broadly available (for example, recent-generation iPhones already have such hardware).
+In contrast, traditional machine learning training can often be done on a laptop (or even a \$5 computer [@arxiv:1809.00238]) in seconds to minutes.
 
-Beyond the necessity for greater data and computational capacity in DL, building and training DL models generally requires more expertise than traditional ML models.
-Currently, there are several competing programming frameworks for DL such as Tensorflow [@arxiv:1603.04467] and PyTorch [@arxiv:1912.01703].
-These frameworks allow users to create and deploy entirely novel model architectures and are widely used in DL research as well as in industrial applications.
-This flexibility combined with the rapid development of the DL field has resulted in large, complex frameworks that can be daunting to new users.
+Beyond the necessity for greater data and computational capacity in deep learning, building and training deep learning models generally requires more expertise than traditional machine learning models.
+Currently, there are several competing programming frameworks for deep learning such as Tensorflow [@arxiv:1603.04467] and PyTorch [@arxiv:1912.01703].
+These frameworks allow users to create and deploy entirely novel model architectures and are widely used in deep learning research as well as in industrial applications.
+This flexibility combined with the rapid development of the deep learning field has resulted in large, complex frameworks that can be daunting to new users.
 For readers new to software development but experienced in biology, gaining computational skills while interfacing with such complex industrial-grade tools can be a challenge.
-An advantage of ML over DL is that currently there are more tools capable of automating the model selection and training process.
-Automated ML (AutoML) tools such as TPOT [@doi:10.1007/978-3-319-31204-0_9], which is capable of using genetic programming to optimize ML pipelines, and Turi Create [@https://github.com/apple/turicreate], a task-oriented ML and DL framework which automatically tests multiple ML models when training, allow users to achieve competitive performance with only a few lines of code.
-Luckily, there are efforts underway to reduce the expertise required to build and use DL models.
-Indeed, both TPOT and Turi Create, as well as other tools such as AutoKeras [@arxiv:1806.10282], are capable of abstracting away much of the programming required for "standard" DL tasks.
-Projects such as Keras [@https://keras.io], a high-level interface for TensorFlow, make it relatively straightforward to design and test custom DL architectures.
-In the future, projects such as these are likely to bring DL experimentation within reach to even more researchers.
+An advantage of machine learning over deep learning is that currently there are more tools capable of automating the model selection and training process.
+Automated machine learning (AutoML) tools such as TPOT [@doi:10.1007/978-3-319-31204-0_9], which is capable of using genetic programming to optimize machine learning pipelines, and Turi Create [@https://github.com/apple/turicreate], a task-oriented machine learning and deep learning framework which automatically tests multiple machine learning models when training, allow users to achieve competitive performance with only a few lines of code.
+Luckily, there are efforts underway to reduce the expertise required to build and use deep learning models.
+Indeed, both TPOT and Turi Create, as well as other tools such as AutoKeras [@arxiv:1806.10282], are capable of abstracting away much of the programming required for "standard" deep learning tasks.
+Projects such as Keras [@https://keras.io], a high-level interface for TensorFlow, make it relatively straightforward to design and test custom deep learning architectures.
+In the future, projects such as these are likely to bring deep learning experimentation within reach to even more researchers.
 
-There are some types of problems in which using DL is strongly indicated over ML.
-Assuming a sufficient quantity of quality data is available, applications such as computer vision and natural language processing are likely to benefit from DL.
-In fact, these areas were the first to see significant breakthroughs through the application of DL [@doi:10.1145/3065386] during the recent "DL revolution."
-For example, Ferreira et al. used DL to recognize individual birds from images [@doi:10.1111/2041-210X.13436].
-This problem was historically difficult but, by combining automatic data collection using RFID tags with data augmentation and transfer learning (explained in [Tip 5](#architecture)), the authors were able to use DL achieve 90% accuracy in several species.
+There are some types of problems in which using deep learning is strongly indicated over machine learning.
+Assuming a sufficient quantity of quality data is available, applications such as computer vision and natural language processing are likely to benefit from deep learning.
+In fact, these areas were the first to see significant breakthroughs through the application of deep learning [@doi:10.1145/3065386] during the recent deep learning revolution."
+For example, Ferreira et al. used deep learning to recognize individual birds from images [@doi:10.1111/2041-210X.13436].
+This problem was historically difficult but, by combining automatic data collection using RFID tags with data augmentation and transfer learning (explained in [Tip 5](#architecture)), the authors were able to use deep learning achieve 90% accuracy in several species.
 Other areas include generative models, in which new samples are able to be created based on the training data, and reinforcement learning, in which agents are trained to interact with their environments.
-In general, before using DL, investigate whether similar problems (including analogous ones in other domains) have been solved successfully using DL.
+In general, before using deep learning, investigate whether similar problems (including analogous ones in other domains) have been solved successfully using deep learning.
 
-Depending on the amount and the nature of the available data, as well as the task to be performed, DL may not always be able to outperform conventional methods.
-As an illustration, Rajkomar et al. [@doi:10.1038/s41746-018-0029-1] found that simpler baseline models achieved performance comparable with that of DL in a number of clinical prediction tasks using electronic health records, which may be a surprise to many.
+Depending on the amount and the nature of the available data, as well as the task to be performed, deep learning may not always be able to outperform conventional methods.
+As an illustration, Rajkomar et al. [@doi:10.1038/s41746-018-0029-1] found that simpler baseline models achieved performance comparable with that of deep learning in a number of clinical prediction tasks using electronic health records, which may be a surprise to many.
 Another example is provided by Koutsoukas et al., who benchmarked several traditional machine learning approaches against deep neural networks for modeling bioactivity data on moderately sized datasets [@doi:10.1186/s13321-017-0226-y].
-The researchers found that while well tuned DL approaches generally tend to outperform conventional classifiers, simple methods such as Naive Bayes classification tend to outperform DL as the noise in the dataset increases.
-Similarly, Chen et al. [@doi:s41746-019-0122-0] tested DL and a variety of traditional ML methods such as logistic regression and random forests on five different clinical datasets, finding that the non-DL methods matched or exceeded the accuracy of the DL model in all cases while requiring an order of magnitude less training time.
+The researchers found that while well tuned deep learning approaches generally tend to outperform conventional classifiers, simple methods such as Naive Bayes classification tend to outperform deep learning as the noise in the dataset increases.
+Similarly, Chen et al. [@doi:s41746-019-0122-0] tested deep learning and a variety of traditional machine learning methods such as logistic regression and random forests on five different clinical datasets, finding that the non deep learning methods matched or exceeded the accuracy of the deep learning model in all cases while requiring an order of magnitude less training time.
 
 In conclusion, deep learning is a tool and, like any other tool, must be used after consideration of its strengths and weaknesses for the problem at hand.
-Once settled upon DL as a potential solution, practitioners should follow the scientific method and compare its performance to traditional methods, as we will see next.
+Once settled upon deep learning as a potential solution, practitioners should follow the scientific method and compare its performance to traditional methods, as we will see next.

From 64da0d01c62c80359814e2cbf54170d15eaa6468 Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Mon, 12 Oct 2020 20:50:21 -0400
Subject: [PATCH 18/21] Add mention of preprocessing after splitting (closes
 #265)

---
 content/09.overfitting.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/content/09.overfitting.md b/content/09.overfitting.md
index e3c3678f..e91eecb4 100644
--- a/content/09.overfitting.md
+++ b/content/09.overfitting.md
@@ -11,6 +11,7 @@ To continue the student analogy, a smarter student has greater potential for mem
 To evaluate deep supervised learning models, they should be trained, tuned, and tested on non-overlapping datasets.
 The data used for testing should be locked and only used one-time for evaluating the final model after all tuning steps are completed.
 Using a test set more than once will lead to biased estimates of the generalization performance [@arxiv:1811.12808; @doi:10.1162/089976698300017197].
+When dataset-dependent preprocessing methods such as quantile normalization (a common approach when analyzing gene-expression data) or standard scaling (in which each feature is set to have a mean of zero and a variance of one) are applied, they must be done after splitting the data or the resulting datasets may not be truly independent.
 Additionally, many conventional metrics for classification (e.g. area under the receiver operating characteristic curve or AUROC) have limited utility in cases of extreme class imbalance [@pmid:25738806].
 Model performance should be evaluated with a carefully picked panel of relevant metrics that make minimal assumptions about the composition of the testing data [@doi:10.1021/acs.molpharmaceut.7b00578], with particular consideration given to metrics that are most directly applicable to the task at hand.
 

From fbd10a0c1be7d4acb8f1d071ec6bcbe5f84f2b83 Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Mon, 12 Oct 2020 22:58:42 -0400
Subject: [PATCH 19/21] Mention that DL generally requires more than can be
 done on a single computer

---
 content/03.when-to-use-dl.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/03.when-to-use-dl.md b/content/03.when-to-use-dl.md
index 292351d6..74951592 100644
--- a/content/03.when-to-use-dl.md
+++ b/content/03.when-to-use-dl.md
@@ -15,7 +15,7 @@ As a rule of thumb, deep learning should only be considered for datasets with at
 Furthermore, training deep learning models can be very demanding, often requiring extensive computing infrastructure and patience to achieve state-of-the-art performance [@doi:10.1109/JPROC.2017.2761740].
 In some deep learning contexts, such as generating human-like text, state-of-the-art models have over one hundred billion parameters [@arxiv:2005.14165].
 Training such large models from scratch can be a costly and time-consuming undertaking [@arxiv:1906.02243].
-Luckily, most deep learning research in biology will not require nearly as much computation, and there are methods for reducing the amount of training required in some cases (described in [Tip 5](#architecture)).
+Luckily, most deep learning research in biology will not require nearly as much computation, though it usually requires more than can be done feasibly on an individual consumer-grade device.
 Specialized hardware such as discrete graphics processing units (GPUs) or custom deep learning accelerators can dramatically reduce the time and cost required to train models, but this hardware is not universally accessible.
 Currently, both GPU- and deep learning-optimized accelerator-equipped servers can be rented from cloud providers, though working with these servers adds additional cost and complexity.
 As deep learning becomes more popular, these accelerators are likely to be more broadly available (for example, recent-generation iPhones already have such hardware).

From a3dd03417cea01b8509d02abda07afc03e40bb73 Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Tue, 13 Oct 2020 12:41:27 -0400
Subject: [PATCH 20/21] Switch sentence for overfitting from preprocessing the
 dataset together

---
 content/09.overfitting.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/09.overfitting.md b/content/09.overfitting.md
index e91eecb4..c29014c4 100644
--- a/content/09.overfitting.md
+++ b/content/09.overfitting.md
@@ -11,7 +11,7 @@ To continue the student analogy, a smarter student has greater potential for mem
 To evaluate deep supervised learning models, they should be trained, tuned, and tested on non-overlapping datasets.
 The data used for testing should be locked and only used one-time for evaluating the final model after all tuning steps are completed.
 Using a test set more than once will lead to biased estimates of the generalization performance [@arxiv:1811.12808; @doi:10.1162/089976698300017197].
-When dataset-dependent preprocessing methods such as quantile normalization (a common approach when analyzing gene-expression data) or standard scaling (in which each feature is set to have a mean of zero and a variance of one) are applied, they must be done after splitting the data or the resulting datasets may not be truly independent.
+While transformation and normalization procedures need to be applied equally to all datasets, the parameters required for such procedures (for example, quantile normalization, a common standardization method when analyzing gene-expression data) should only be derived from training data, not tuning and test data, to keep the latter two independent.
 Additionally, many conventional metrics for classification (e.g. area under the receiver operating characteristic curve or AUROC) have limited utility in cases of extreme class imbalance [@pmid:25738806].
 Model performance should be evaluated with a carefully picked panel of relevant metrics that make minimal assumptions about the composition of the testing data [@doi:10.1021/acs.molpharmaceut.7b00578], with particular consideration given to metrics that are most directly applicable to the task at hand.
 

From e41ac26488d98c3a1d5a3eb8abd87fb0d4f96597 Mon Sep 17 00:00:00 2001
From: Benjamin Lee <benjamindlee@me.com>
Date: Tue, 13 Oct 2020 14:16:33 -0400
Subject: [PATCH 21/21] Rephrase and add citation about the minimum data
 required for DL

---
 content/03.when-to-use-dl.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/03.when-to-use-dl.md b/content/03.when-to-use-dl.md
index 74951592..5cd4be3c 100644
--- a/content/03.when-to-use-dl.md
+++ b/content/03.when-to-use-dl.md
@@ -10,7 +10,7 @@ Training deep learning models requires a significant amount of data, computing p
 In some areas of biology where data collection is thoroughly automated, such as DNA sequencing, large amounts of quality data may be available.
 For other areas which rely on manual data collection, there may not be enough data to effectively train models.
 Though there are methods to increase the amount of training data, such as data augmentation (in which existing data is slightly manipulated to yield "new" samples) and weak supervision (in which simple labeling heuristics are combined to produce noisy, probabilistic labels) [@arxiv:1605.07723v3], these methods cannot overcome a complete shortage of data.
-As a rule of thumb, deep learning should only be considered for datasets with at least one thousand samples, though it is best suited to cases when datasets contain orders of magnitude more samples.
+In the context of supervised classification, deep learning should be considered for datasets with at least one hundred samples per class [@arxiv:1511.06348] as a rule of thumb, though in all cases it is best suited to cases when datasets contain orders of magnitude more samples.
 
 Furthermore, training deep learning models can be very demanding, often requiring extensive computing infrastructure and patience to achieve state-of-the-art performance [@doi:10.1109/JPROC.2017.2761740].
 In some deep learning contexts, such as generating human-like text, state-of-the-art models have over one hundred billion parameters [@arxiv:2005.14165].