Skip to content

Commit

Permalink
Close #202
Browse files Browse the repository at this point in the history
  • Loading branch information
Benjamin-Lee committed Sep 16, 2020
1 parent 9d6e87d commit 1ff7162
Showing 1 changed file with 5 additions and 3 deletions.
8 changes: 5 additions & 3 deletions content/09.overfitting.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,11 @@ Additionally, there are a variety of techniques to reduce overfitting during tra
Another way, as described by Chuang and Keiser, is to identify the baseline level of memorization of the network by training on the data with the labels randomly shuffled and to see if the model performs better on the actual data [@doi:10.1021/acschembio.8b00881].
If the model performs no better on real data than randomly scrambled data, then the performance of the model can be attributed to overfitting.

Additionally, one must be sure that their dataset is not skewed or biased, such as by having confounding and scientifically irrelevant variables that the model can pick up on [@doi:10.1371/journal.pmed.1002683].
In this case, simply holding out test data is insufficient.
Additionally, one must be sure that their dataset is not skewed or biased, such as by having confounding and scientifically irrelevant variables that the model can pick up on.
For example, a DL model for pneumonia detection in chest X-rays performed well but failed to generalize to outside hospitals because they were able to detect which hospital the image was from and adjust accordingly [@doi:10.1371/journal.pmed.1002683].
Similarly, when dealing with sequence data, holding out data that are evolutionarily related or share structural homology to the training data can result in overfitting.
In these cases, simply holding out test data is insufficient.
The best remedy for confounding variables is to [know your data](#know-your-problem) and to test your model on truly independent data.

In essence, split your data into training, tuning, and single-use testing sets to assess the performance of the model on data it truly has not seen before.
In essence, split your data into training, tuning, and single-use testing sets to assess the performance of the model on data it truly has not seen before.
Additionally, be cognizant of the danger of skewed or biased data artificially inflating accuracy.

0 comments on commit 1ff7162

Please sign in to comment.