Processing before splitting can cause overfitting #265

Benjamin-Lee · 2020-10-12T23:54:20Z

Not sure it's worth mentioning that if a single dataset is split into 2 or 3 for training, tuning, testing and preprocessing (like quantile normalization) is applied before this split, then the resulting datasets may not be truly independent.

Originally posted by @SiminaB in #241 (comment)

rasbt · 2020-10-13T04:24:36Z

yeah, we are getting into the territory where it's a bit tricky to think about talking about necessary details vs keeping the article readable, concise, and simple enough to follow for beginners.

We could maybe summarize this issue, and related ones, in a sentence like

"While transformation and normalization needs to be applied equally to all datasets, parameters required for such transformation and normalization procedures (for example, the dataset minimum and maximum values for 0-1 normalization, aka min-max scaling) should only be derived from training data, not tuning and test data, to keep the latter two independent."

Benjamin-Lee · 2020-10-13T04:57:33Z

The current sentence addressing the topic is:

When dataset-dependent preprocessing methods such as ) or standard scaling (in which each feature is set to have a mean of zero and a variance of one) are applied, they must be done after splitting the data or the resulting datasets may not be truly independent.

I prefer your sentence in general, though I have a few suggestions. What about something like:

While transformation and normalization procedures need to be applied equally to all datasets, the parameters required for such procedures (for example, quantile normalization, a common standardization method when analyzing gene-expression data) should only be derived from training data, not tuning and test data, to keep the latter two independent.

Ideally, we could preserve the example of quantile normalization, which is a more biological example.

rasbt · 2020-10-13T15:24:27Z

I like you edit including the more biologically relevant context

Benjamin-Lee · 2020-10-13T15:59:59Z

Ok I'll put that into the PR instead of what I've got now. Thanks!

rasbt · 2020-10-13T16:14:28Z

Perfect, thanks! I am planning to go over the manuscript once the midterms are over and some of the PRs got merged. Right now, I don't want to double-edits things

Benjamin-Lee added a commit that referenced this issue Oct 13, 2020

Add mention of preprocessing after splitting (closes #265)

64da0d0

Benjamin-Lee linked a pull request Oct 13, 2020 that will close this issue

New Tip 1 #241

Merged

1 task

Benjamin-Lee closed this as completed in #241 Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing before splitting can cause overfitting #265

Processing before splitting can cause overfitting #265

Benjamin-Lee commented Oct 12, 2020

rasbt commented Oct 13, 2020

Benjamin-Lee commented Oct 13, 2020

rasbt commented Oct 13, 2020

Benjamin-Lee commented Oct 13, 2020

rasbt commented Oct 13, 2020

Processing before splitting can cause overfitting #265

Processing before splitting can cause overfitting #265

Comments

Benjamin-Lee commented Oct 12, 2020

rasbt commented Oct 13, 2020

Benjamin-Lee commented Oct 13, 2020

rasbt commented Oct 13, 2020

Benjamin-Lee commented Oct 13, 2020

rasbt commented Oct 13, 2020