-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processing before splitting can cause overfitting #265
Comments
yeah, we are getting into the territory where it's a bit tricky to think about talking about necessary details vs keeping the article readable, concise, and simple enough to follow for beginners. We could maybe summarize this issue, and related ones, in a sentence like "While transformation and normalization needs to be applied equally to all datasets, parameters required for such transformation and normalization procedures (for example, the dataset minimum and maximum values for 0-1 normalization, aka min-max scaling) should only be derived from training data, not tuning and test data, to keep the latter two independent." |
The current sentence addressing the topic is:
I prefer your sentence in general, though I have a few suggestions. What about something like:
Ideally, we could preserve the example of quantile normalization, which is a more biological example. |
I like you edit including the more biologically relevant context |
Ok I'll put that into the PR instead of what I've got now. Thanks! |
Perfect, thanks! I am planning to go over the manuscript once the midterms are over and some of the PRs got merged. Right now, I don't want to double-edits things |
Not sure it's worth mentioning that if a single dataset is split into 2 or 3 for training, tuning, testing and preprocessing (like quantile normalization) is applied before this split, then the resulting datasets may not be truly independent.
Originally posted by @SiminaB in #241 (comment)
The text was updated successfully, but these errors were encountered: