Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing before splitting can cause overfitting #265

Closed
Benjamin-Lee opened this issue Oct 12, 2020 · 5 comments · Fixed by #241
Closed

Processing before splitting can cause overfitting #265

Benjamin-Lee opened this issue Oct 12, 2020 · 5 comments · Fixed by #241

Comments

@Benjamin-Lee
Copy link
Owner

Not sure it's worth mentioning that if a single dataset is split into 2 or 3 for training, tuning, testing and preprocessing (like quantile normalization) is applied before this split, then the resulting datasets may not be truly independent.

Originally posted by @SiminaB in #241 (comment)

@Benjamin-Lee Benjamin-Lee linked a pull request Oct 13, 2020 that will close this issue
1 task
@rasbt
Copy link
Collaborator

rasbt commented Oct 13, 2020

yeah, we are getting into the territory where it's a bit tricky to think about talking about necessary details vs keeping the article readable, concise, and simple enough to follow for beginners.

We could maybe summarize this issue, and related ones, in a sentence like

"While transformation and normalization needs to be applied equally to all datasets, parameters required for such transformation and normalization procedures (for example, the dataset minimum and maximum values for 0-1 normalization, aka min-max scaling) should only be derived from training data, not tuning and test data, to keep the latter two independent."

@Benjamin-Lee
Copy link
Owner Author

The current sentence addressing the topic is:

When dataset-dependent preprocessing methods such as ) or standard scaling (in which each feature is set to have a mean of zero and a variance of one) are applied, they must be done after splitting the data or the resulting datasets may not be truly independent.

I prefer your sentence in general, though I have a few suggestions. What about something like:

While transformation and normalization procedures need to be applied equally to all datasets, the parameters required for such procedures (for example, quantile normalization, a common standardization method when analyzing gene-expression data) should only be derived from training data, not tuning and test data, to keep the latter two independent.

Ideally, we could preserve the example of quantile normalization, which is a more biological example.

@rasbt
Copy link
Collaborator

rasbt commented Oct 13, 2020

I like you edit including the more biologically relevant context

@Benjamin-Lee
Copy link
Owner Author

Ok I'll put that into the PR instead of what I've got now. Thanks!

@rasbt
Copy link
Collaborator

rasbt commented Oct 13, 2020

Perfect, thanks! I am planning to go over the manuscript once the midterms are over and some of the PRs got merged. Right now, I don't want to double-edits things

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants