-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Toby Dylan Hocking
committed
Oct 22, 2024
1 parent
9518d80
commit f06b107
Showing
4 changed files
with
116 additions
and
97 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,7 @@ | ||
Package: mlr3resampling | ||
Type: Package | ||
Title: Resampling Algorithms for 'mlr3' Framework | ||
Version: 2024.9.6 | ||
Version: 2024.10.22 | ||
Authors@R: c( | ||
person("Toby", "Hocking", | ||
email="[email protected]", | ||
|
@@ -35,26 +35,19 @@ Authors@R: c( | |
) | ||
Description: A supervised learning algorithm inputs a train set, | ||
and outputs a prediction function, which can be used on a test set. | ||
If each data point belongs to a group | ||
If each data point belongs to a subset | ||
(such as geographic region, year, etc), then | ||
how do we know if it is possible to train on one group, and predict | ||
accurately on another group? Cross-validation can be used to determine | ||
the extent to which this is possible, by first assigning fold IDs from | ||
1 to K to all data (possibly using stratification, usually by group | ||
and label). Then we loop over test sets (group/fold combinations), | ||
train sets (same group, other groups, all groups), and compute | ||
test/prediction accuracy for each combination. Comparing | ||
test/prediction accuracy between same and other, we can determine the | ||
extent to which it is possible (perfect if same/other have similar | ||
test accuracy for each group; other is usually somewhat less accurate | ||
than same; other can be just as bad as featureless baseline when the | ||
groups have different patterns). | ||
For more information, | ||
<https://tdhock.github.io/blog/2023/R-gen-new-subsets/> | ||
describes the method in depth. | ||
How many train samples are required to get accurate predictions on a | ||
test set? Cross-validation can be used to answer this question, with | ||
variable size train sets. | ||
how do we know if subsets are similar enough so that | ||
we can get accurate predictions on one subset, | ||
after training on Other subsets? | ||
And how do we know if training on All subsets would improve | ||
prediction accuracy, relative to training on the Same subset? | ||
SOAK, Same/Other/All K-fold cross-validation, <arXiv:2410.08643> | ||
can be used to answer these question, by fixing a test subset, | ||
training models on Same/Other/All subsets, and then | ||
comparing test error rates (Same versus Other and Same versus All). | ||
Also provides code for estimating how many train samples | ||
are required to get accurate predictions on a test set. | ||
License: GPL-3 | ||
URL: https://github.com/tdhock/mlr3resampling | ||
BugReports: https://github.com/tdhock/mlr3resampling/issues | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters