group -> subset in docs

tdhock · Oct 22, 2024 · f06b107 · f06b107
1 parent 9518d80
commit f06b107
Show file tree

Hide file tree

Showing 4 changed files with 116 additions and 97 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,7 +1,7 @@
 Package: mlr3resampling
 Type: Package
 Title: Resampling Algorithms for 'mlr3' Framework
-Version: 2024.9.6
+Version: 2024.10.22
 Authors@R: c(
     person("Toby", "Hocking",
      email="[email protected]",
@@ -35,26 +35,19 @@ Authors@R: c(
   )
 Description: A supervised learning algorithm inputs a train set,
  and outputs a prediction function, which can be used on a test set.
- If each data point belongs to a group
+ If each data point belongs to a subset
  (such as geographic region, year, etc), then
- how do we know if it is possible to train on one group, and predict
- accurately on another group? Cross-validation can be used to determine
- the extent to which this is possible, by first assigning fold IDs from
- 1 to K to all data (possibly using stratification, usually by group
- and label). Then we loop over test sets (group/fold combinations),
- train sets (same group, other groups, all groups), and compute
- test/prediction accuracy for each combination.  Comparing
- test/prediction accuracy between same and other, we can determine the
- extent to which it is possible (perfect if same/other have similar
- test accuracy for each group; other is usually somewhat less accurate
- than same; other can be just as bad as featureless baseline when the
- groups have different patterns).
- For more information,
- <https://tdhock.github.io/blog/2023/R-gen-new-subsets/>
- describes the method in depth.
- How many train samples are required to get accurate predictions on a
- test set? Cross-validation can be used to answer this question, with
- variable size train sets.
+ how do we know if subsets are similar enough so that
+ we can get accurate predictions on one subset, 
+ after training on Other subsets?
+ And how do we know if training on All subsets would improve
+ prediction accuracy, relative to training on the Same subset?
+ SOAK, Same/Other/All K-fold cross-validation, <arXiv:2410.08643>
+ can be used to answer these question, by fixing a test subset,
+ training models on Same/Other/All subsets, and then
+ comparing test error rates (Same versus Other and Same versus All).
+ Also provides code for estimating how many train samples
+ are required to get accurate predictions on a test set.
 License: GPL-3
 URL: https://github.com/tdhock/mlr3resampling
 BugReports: https://github.com/tdhock/mlr3resampling/issues

diff --git a/README.org b/README.org
@@ -15,29 +15,31 @@ framework in R
 
 ** Description
 
-For an overview of functionality, [[https://tdhock.github.io/blog/2024/cv-all-same-new/][please read my recent blog post]].
+For an overview of functionality, [[https://tdhock.github.io/blog/2024/cv-all-same-new/][please read my recent blog post]], the
+[[https://arxiv.org/abs/2410.08643][SOAK arXiv paper]], and [[https://github.com/tdhock/mlr3resampling/wiki/Articles][other articles]].
 
-*** Algorithm 1: cross-validation for comparing train on same and other
+*** SOAK: Same/Other/All K-fold cross-validation for estimating similarity of patterns in data subsets
 
 See examples in [[https://cloud.r-project.org/web/packages/mlr3resampling/vignettes/ResamplingSameOtherSizesCV.html][ResamplingSameOtherSizesCV vignette]] and data viz for
 [[https://tdhock.github.io/2023-12-13-train-predict-subsets-regression/][regression]] and [[https://tdhock.github.io/2023-12-13-train-predict-subsets-classification/][classification]].
 
 A supervised learning algorithm inputs a train set, and outputs a
-prediction function, which can be used on a test set. If each data
-point belongs to a group (such as geographic region, year, etc), then
-how do we know if it is possible to train on one group, and predict
-accurately on another group? Cross-validation can be used to determine
-the extent to which this is possible, by first assigning fold IDs from
-1 to K to all data (possibly using stratification, usually by group
-and label). Then we loop over test sets (group/fold combinations),
-train sets (same group, other groups, all groups), and compute
-test/prediction accuracy for each combination.  Comparing
-test/prediction accuracy between same and other, we can determine the
-extent to which it is possible:
-
-- perfect if same/other have similar test accuracy for each group, and all is more accuate;
-- other/all are usually somewhat less accurate than same in real data;
-- other can be just as bad as featureless baseline when the groups have different patterns.
+prediction function, which can be used on a test set.  If each data
+point belongs to a subset (such as geographic region, year, etc), then
+how do we know if subsets are similar enough so that we can get
+accurate predictions on one subset, after training on Other subsets?
+And how do we know if training on All subsets would improve prediction
+accuracy, relative to training on the Same subset?  SOAK,
+Same/Other/All K-fold cross-validation, can be used to answer these
+question, by fixing a test subset, training models on Same/Other/All
+subsets, and then comparing test error rates (Same versus Other and
+Same versus All).
+
+- subsets are similar if All is more accurate than Same, and the
+  subset with more data (Same/Other) is more accurate.
+- subsets are different if All/Other is less accurate than Same.
+- Other can be just as bad as featureless baseline (or worse) when the
+  subsets have different patterns.
 
 This is implemented in =ResamplingSameOtherSizesCV= when you use it on
 a task that defines the =subset= role, for example the Arizona trees
@@ -60,16 +62,16 @@ experiment.  The rows 12,15,18 below represent splits that attempt to
 answer that question (test.subset=S, train.subsets=other).
 
 #+begin_src R
+> same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
 > task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
 > task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
-> task.obj$col_roles$strata <- "y"  #keep data proportional when splitting.
+> task.obj$col_roles$stratum <- "y"  #keep data proportional when splitting.
 > task.obj$col_roles$group <- "polygon"  #keep data together when splitting.
 > task.obj$col_roles$subset <- "region3" #fix one test region, train on same/other/all region(s).
-> same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
 > same_other_sizes_cv$instantiate(task.obj)
 > same_other_sizes_cv$instance$iteration.dt[, .(test.subset, train.subsets, test.fold)]
     test.subset train.subsets test.fold
-	 <char>        <char>     <int>
+         <char>        <char>     <int>
  1:          NE           all         1
  2:          NW           all         1
  3:           S           all         1
@@ -105,18 +107,20 @@ The rows in the output above represent different kinds of splits:
 - train.subsets=same is used as a baseline.
 - train.subsets=all is used to answer the question, "is it beneficial
   to combine all subsets when training?"
+- train.subsets=other is used to answer the question, "can we train on
+  one subset, and accurately predict on another?"
 
 Code to re-run:
 
 #+begin_src R
   data(AZtrees,package="mlr3resampling")
   table(AZtrees$region3)
+  same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
   task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
   task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
-  task.obj$col_roles$strata <- "y"  #keep data proportional when splitting.
+  task.obj$col_roles$stratum <- "y"  #keep data proportional when splitting.
   task.obj$col_roles$group <- "polygon"  #keep data together when splitting.
   task.obj$col_roles$subset <- "region3" #fix one test region, train on same/other/all region(s).
-  same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
   same_other_sizes_cv$instantiate(task.obj)
   same_other_sizes_cv$instance$iteration.dt[, .(test.subset, train.subsets, test.fold)]
 #+end_src
@@ -139,14 +143,18 @@ below,
 #+end_src
 
 The output above indicates we have 5956 rows and 189 polygons. We can
-do cross-validation on either polygons (if task has =group= role) or
-rows (if no =group= role set). The code below sets a down-sampling
+do cross-validation on either polygons (if task has =subset= role) or
+rows (if no =subset= role set). The code below sets a down-sampling
 =ratio= of 0.8, and four =sizes= of down-sampled train sets.
 
 #+begin_src R
 > same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
-> same_other_sizes_cv$param_set$values$ratio <- 0.8
 > same_other_sizes_cv$param_set$values$sizes <- 4
+> same_other_sizes_cv$param_set$values$ratio <- 0.8
+> task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
+> task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
+> task.obj$col_roles$stratum <- "y"  #keep data proportional when splitting.
+> task.obj$col_roles$group <- "polygon"  #keep data together when splitting.
 > same_other_sizes_cv$instantiate(task.obj)
 > same_other_sizes_cv$instance$iteration.dt[, .(n.train.groups, test.fold)]
     n.train.groups test.fold
@@ -180,28 +188,22 @@ Code to re-run:
   data(AZtrees,package="mlr3resampling")
   dim(AZtrees)
   length(unique(AZtrees$polygon))
-  task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
-  task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
-  task.obj$col_roles$strata <- "y"  #keep data proportional when splitting.
-  task.obj$col_roles$group <- "polygon"  #keep data together when splitting.
   same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
   same_other_sizes_cv$param_set$values$sizes <- 4
   same_other_sizes_cv$param_set$values$ratio <- 0.8
+  task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
+  task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
+  task.obj$col_roles$stratum <- "y"  #keep data proportional when splitting.
+  task.obj$col_roles$group <- "polygon"  #keep data together when splitting.
   same_other_sizes_cv$instantiate(task.obj)
   same_other_sizes_cv$instance$iteration.dt[, .(n.train.groups, test.fold)]
 #+end_src
 
-*** Older Usage Examples and Discussion
-
-Older examples in [[https://github.com/tdhock/mlr3resampling/blob/main/vignettes/Older_resamplers.Rmd][Older resamplers vignette]] (useful for visualization).
-
-The examples linked below have examples with larger data sizes than
-the examples in the CRAN vignettes linked above.
-
-- https://tdhock.github.io/blog/2023/R-gen-new-subsets/
-- https://tdhock.github.io/blog/2023/variable-size-train/
-
 ** Related work
 
-mlr3resampling code was copied/modified from Resampling and
-ResamplingCV classes in the excellent [[https://github.com/mlr-org/mlr3][mlr3]] package.
+- mlr3resampling code was copied/modified from Resampling and
+  ResamplingCV classes in the excellent [[https://github.com/mlr-org/mlr3][mlr3]] package.
+- As of Oct 2024, scikit-learn in python implements support for groups
+  via [[https://scikit-learn.org/stable/modules/cross_validation.html#group-k-fold][GroupKFold]] (keeping samples together when splitting) but not
+  subsets (test data come from one subset, train data come from
+  Same/Other/All subsets).
diff --git a/man/ResamplingSameOtherCV.Rd b/man/ResamplingSameOtherCV.Rd
@@ -1,6 +1,6 @@
 \name{ResamplingSameOtherCV}
 \alias{ResamplingSameOtherCV}
-\title{Resampling for comparing training on same or other groups}
+\title{Resampling for comparing training on same or other subsets}
 \description{
   \code{\link{ResamplingSameOtherCV}}
   defines how a task is partitioned for
@@ -10,28 +10,38 @@
 
   Resampling objects can be instantiated on a
   \code{\link[mlr3:Task]{Task}},
-  which should define at least one group variable.
+  which should define at least one subset variable.
 
   After instantiation, sets can be accessed via
   \verb{$train_set(i)} and
   \verb{$test_set(i)}, respectively. 
 }
 \details{
+  This provides an implementation of SOAK, Same/Other/All K-fold
+  cross-validation. After instantiation, this class provides information
+  in \verb{$instance} that can be used for visualizing the
+  splits, as shown in the vignette. Most typical machine learning users
+  should instead use
+  \code{\link{ResamplingSameOtherSizesCV}}, which does not support these
+  visualization features, but provides other relevant machine learning
+  features, such as group role, which is not supported by
+  \code{\link{ResamplingSameOtherCV}}.
+
   A supervised learning algorithm inputs a train set, and outputs a
   prediction function, which can be used on a test set. If each data
-  point belongs to a group (such as geographic region, year, etc), then
-  how do we know if it is possible to train on one group, and predict
-  accurately on another group? Cross-validation can be used to determine
+  point belongs to a subset (such as geographic region, year, etc), then
+  how do we know if it is possible to train on one subset, and predict
+  accurately on another subset? Cross-validation can be used to determine
   the extent to which this is possible, by first assigning fold IDs from
-  1 to K to all data (possibly using stratification, usually by group
-  and label). Then we loop over test sets (group/fold combinations),
-  train sets (same group, other groups, all groups), and compute
+  1 to K to all data (possibly using stratification, usually by subset
+  and label). Then we loop over test sets (subset/fold combinations),
+  train sets (same subset, other subsets, all subsets), and compute
   test/prediction accuracy for each combination.  Comparing
   test/prediction accuracy between same and other, we can determine the
   extent to which it is possible (perfect if same/other have similar
-  test accuracy for each group; other is usually somewhat less accurate
+  test accuracy for each subset; other is usually somewhat less accurate
   than same; other can be just as bad as featureless baseline when the
-  groups have different patterns).
+  subsets have different patterns).
 }
 \section{Stratification}{
 
@@ -44,21 +54,28 @@ each combination of the values of the stratification variables forms a stratum.
 
 \section{Grouping}{
 
-\code{\link{ResamplingSameOtherCV}} supports grouping of observations.
-The grouping variable is assumed to be discrete,
-and must be stored in the \link[mlr3:Task]{Task} with column role \code{"group"}.
+\code{\link{ResamplingSameOtherCV}} does not support grouping of
+observations that should not be split in cross-validation.
+See \code{\link{ResamplingSameOtherSizesCV}} for another sampler which
+does support both \code{group} and \code{subset} roles.
+
+}
+
+\section{Subsets}{
 
+The subset variable is assumed to be discrete,
+and must be stored in the \link[mlr3:Task]{Task} with column role \code{"subset"}.
 The number of cross-validation folds K should be defined as the
 \code{fold} parameter.
-
-In each group, there will be about an equal number of observations
+In each subset, there will be about an equal number of observations
 assigned to each of the K folds.
 The assignments are stored in
 \verb{$instance$id.dt}.
 The train/test splits are defined by all possible combinations of
-test group, test fold, and train groups (same/other/all).
+test subset, test fold, and train subsets (Same/Other/All).
 The splits are stored in
 \verb{$instance$iteration.dt}.
+
 }
 
 \examples{
@@ -67,12 +84,14 @@ same_other$param_set$values$folds <- 5
 }
 \seealso{
   \itemize{
-    \item Blog post
-    \url{https://tdhock.github.io/blog/2023/R-gen-new-subsets/}
+    \item arXiv paper \url{https://arxiv.org/abs/2410.08643} describing
+    SOAK algorithm.
+    \item Articles
+    \url{https://github.com/tdhock/mlr3resampling/wiki/Articles}
     \item Package \CRANpkg{mlr3} for standard
     \code{\link[mlr3:Resampling]{Resampling}}, which does not support comparing
-    train on same or other groups.
-    \item \code{\link{score}} and Simulations vignette for more detailed examples.
+    train on Same/Other/All subsets.
+    \item \code{vignette(package="mlr3resampling")} for more detailed examples.
   }
 }
 \concept{Resampling}

diff --git a/man/ResamplingSameOtherSizesCV.Rd b/man/ResamplingSameOtherSizesCV.Rd
@@ -10,28 +10,29 @@
 
   Resampling objects can be instantiated on a
   \code{\link[mlr3:Task]{Task}},
-  which should define at least one group variable.
+  which can use the \code{subset} role.
 
   After instantiation, sets can be accessed via
   \verb{$train_set(i)} and
   \verb{$test_set(i)}, respectively. 
 }
 \details{
+  This is an implementation of SOAK, Same/Other/All K-fold cross-validation.
   A supervised learning algorithm inputs a train set, and outputs a
   prediction function, which can be used on a test set. If each data
-  point belongs to a group (such as geographic region, year, etc), then
-  how do we know if it is possible to train on one group, and predict
-  accurately on another group? Cross-validation can be used to determine
+  point belongs to a subset (such as geographic region, year, etc), then
+  how do we know if it is possible to train on one subset, and predict
+  accurately on another subset? Cross-validation can be used to determine
   the extent to which this is possible, by first assigning fold IDs from
-  1 to K to all data (possibly using stratification, usually by group
-  and label). Then we loop over test sets (group/fold combinations),
-  train sets (same group, other groups, all groups), and compute
+  1 to K to all data (possibly using stratification, usually by subset
+  and label). Then we loop over test sets (subset/fold combinations),
+  train sets (same subset, other subsets, all subsets), and compute
   test/prediction accuracy for each combination.  Comparing
   test/prediction accuracy between same and other, we can determine the
   extent to which it is possible (perfect if same/other have similar
-  test accuracy for each group; other is usually somewhat less accurate
+  test accuracy for each subset; other is usually somewhat less accurate
   than same; other can be just as bad as featureless baseline when the
-  groups have different patterns).
+  subsets have different patterns).
 
   This class has more parameters/potential applications than
   \code{\link{ResamplingSameOtherCV}} and
@@ -50,9 +51,11 @@ each combination of the values of the stratification variables forms a stratum.
 
 \section{Grouping}{
 
-\code{\link{ResamplingSameOtherSizesCV}} supports grouping of observations.
+\code{\link{ResamplingSameOtherSizesCV}} supports grouping of
+observations that will not be split in cross-validation.
 The grouping variable is assumed to be discrete,
-and must be stored in the \link[mlr3:Task]{Task} with column role \code{"group"}.
+and must be stored in the \link[mlr3:Task]{Task} with column role
+\code{"group"}.
 
 }
 
@@ -87,7 +90,7 @@ The \code{ignore_subset} parameter should be either \code{TRUE} or
 role. \code{TRUE} only creates splits for same subset (even if task
 defines \code{subset} role), and is useful for subtrain/validation
 splits (hyper-parameter learning). Note that this feature will work on a
-task with \code{stratum} and \code{group} roles (unlike
+task with both \code{stratum} and \code{group} roles (unlike
 \code{ResamplingCV}).
 
 In each subset, there will be about an equal number of observations
@@ -105,12 +108,14 @@ same_other_sizes$param_set$values$folds <- 5
 }
 \seealso{
   \itemize{
-    \item Blog post
-    \url{https://tdhock.github.io/blog/2023/R-gen-new-subsets/}
+    \item arXiv paper \url{https://arxiv.org/abs/2410.08643} describing
+    SOAK algorithm.
+    \item Articles
+    \url{https://github.com/tdhock/mlr3resampling/wiki/Articles}
     \item Package \CRANpkg{mlr3} for standard
     \code{\link[mlr3:Resampling]{Resampling}}, which does not support comparing
-    train on same or other groups.
-    \item \code{\link{score}} and Simulations vignette for more detailed examples.
+    train on Same/Other/All subsets.
+    \item \code{vignette(package="mlr3resampling")} for more detailed examples.
   }
 }
 \concept{Resampling}