From 1ff716279b2ae8156de6a573041169a632f122df Mon Sep 17 00:00:00 2001 From: Benjamin Lee Date: Tue, 15 Sep 2020 21:47:50 -0400 Subject: [PATCH] Close #202 --- content/09.overfitting.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/content/09.overfitting.md b/content/09.overfitting.md index a86dd138..6de24e92 100644 --- a/content/09.overfitting.md +++ b/content/09.overfitting.md @@ -16,9 +16,11 @@ Additionally, there are a variety of techniques to reduce overfitting during tra Another way, as described by Chuang and Keiser, is to identify the baseline level of memorization of the network by training on the data with the labels randomly shuffled and to see if the model performs better on the actual data [@doi:10.1021/acschembio.8b00881]. If the model performs no better on real data than randomly scrambled data, then the performance of the model can be attributed to overfitting. -Additionally, one must be sure that their dataset is not skewed or biased, such as by having confounding and scientifically irrelevant variables that the model can pick up on [@doi:10.1371/journal.pmed.1002683]. -In this case, simply holding out test data is insufficient. +Additionally, one must be sure that their dataset is not skewed or biased, such as by having confounding and scientifically irrelevant variables that the model can pick up on. +For example, a DL model for pneumonia detection in chest X-rays performed well but failed to generalize to outside hospitals because they were able to detect which hospital the image was from and adjust accordingly [@doi:10.1371/journal.pmed.1002683]. +Similarly, when dealing with sequence data, holding out data that are evolutionarily related or share structural homology to the training data can result in overfitting. +In these cases, simply holding out test data is insufficient. The best remedy for confounding variables is to [know your data](#know-your-problem) and to test your model on truly independent data. -In essence, split your data into training, tuning, and single-use testing sets to assess the performance of the model on data it truly has not seen before. +In essence, split your data into training, tuning, and single-use testing sets to assess the performance of the model on data it truly has not seen before. Additionally, be cognizant of the danger of skewed or biased data artificially inflating accuracy.