Skip to content

Commit

Permalink
group -> subset in docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Toby Dylan Hocking committed Oct 22, 2024
1 parent 9518d80 commit f06b107
Show file tree
Hide file tree
Showing 4 changed files with 116 additions and 97 deletions.
33 changes: 13 additions & 20 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: mlr3resampling
Type: Package
Title: Resampling Algorithms for 'mlr3' Framework
Version: 2024.9.6
Version: 2024.10.22
Authors@R: c(
person("Toby", "Hocking",
email="[email protected]",
Expand Down Expand Up @@ -35,26 +35,19 @@ Authors@R: c(
)
Description: A supervised learning algorithm inputs a train set,
and outputs a prediction function, which can be used on a test set.
If each data point belongs to a group
If each data point belongs to a subset
(such as geographic region, year, etc), then
how do we know if it is possible to train on one group, and predict
accurately on another group? Cross-validation can be used to determine
the extent to which this is possible, by first assigning fold IDs from
1 to K to all data (possibly using stratification, usually by group
and label). Then we loop over test sets (group/fold combinations),
train sets (same group, other groups, all groups), and compute
test/prediction accuracy for each combination. Comparing
test/prediction accuracy between same and other, we can determine the
extent to which it is possible (perfect if same/other have similar
test accuracy for each group; other is usually somewhat less accurate
than same; other can be just as bad as featureless baseline when the
groups have different patterns).
For more information,
<https://tdhock.github.io/blog/2023/R-gen-new-subsets/>
describes the method in depth.
How many train samples are required to get accurate predictions on a
test set? Cross-validation can be used to answer this question, with
variable size train sets.
how do we know if subsets are similar enough so that
we can get accurate predictions on one subset,
after training on Other subsets?
And how do we know if training on All subsets would improve
prediction accuracy, relative to training on the Same subset?
SOAK, Same/Other/All K-fold cross-validation, <arXiv:2410.08643>
can be used to answer these question, by fixing a test subset,
training models on Same/Other/All subsets, and then
comparing test error rates (Same versus Other and Same versus All).
Also provides code for estimating how many train samples
are required to get accurate predictions on a test set.
License: GPL-3
URL: https://github.com/tdhock/mlr3resampling
BugReports: https://github.com/tdhock/mlr3resampling/issues
Expand Down
84 changes: 43 additions & 41 deletions README.org
Original file line number Diff line number Diff line change
Expand Up @@ -15,29 +15,31 @@ framework in R

** Description

For an overview of functionality, [[https://tdhock.github.io/blog/2024/cv-all-same-new/][please read my recent blog post]].
For an overview of functionality, [[https://tdhock.github.io/blog/2024/cv-all-same-new/][please read my recent blog post]], the
[[https://arxiv.org/abs/2410.08643][SOAK arXiv paper]], and [[https://github.com/tdhock/mlr3resampling/wiki/Articles][other articles]].

*** Algorithm 1: cross-validation for comparing train on same and other
*** SOAK: Same/Other/All K-fold cross-validation for estimating similarity of patterns in data subsets

See examples in [[https://cloud.r-project.org/web/packages/mlr3resampling/vignettes/ResamplingSameOtherSizesCV.html][ResamplingSameOtherSizesCV vignette]] and data viz for
[[https://tdhock.github.io/2023-12-13-train-predict-subsets-regression/][regression]] and [[https://tdhock.github.io/2023-12-13-train-predict-subsets-classification/][classification]].

A supervised learning algorithm inputs a train set, and outputs a
prediction function, which can be used on a test set. If each data
point belongs to a group (such as geographic region, year, etc), then
how do we know if it is possible to train on one group, and predict
accurately on another group? Cross-validation can be used to determine
the extent to which this is possible, by first assigning fold IDs from
1 to K to all data (possibly using stratification, usually by group
and label). Then we loop over test sets (group/fold combinations),
train sets (same group, other groups, all groups), and compute
test/prediction accuracy for each combination. Comparing
test/prediction accuracy between same and other, we can determine the
extent to which it is possible:

- perfect if same/other have similar test accuracy for each group, and all is more accuate;
- other/all are usually somewhat less accurate than same in real data;
- other can be just as bad as featureless baseline when the groups have different patterns.
prediction function, which can be used on a test set. If each data
point belongs to a subset (such as geographic region, year, etc), then
how do we know if subsets are similar enough so that we can get
accurate predictions on one subset, after training on Other subsets?
And how do we know if training on All subsets would improve prediction
accuracy, relative to training on the Same subset? SOAK,
Same/Other/All K-fold cross-validation, can be used to answer these
question, by fixing a test subset, training models on Same/Other/All
subsets, and then comparing test error rates (Same versus Other and
Same versus All).

- subsets are similar if All is more accurate than Same, and the
subset with more data (Same/Other) is more accurate.
- subsets are different if All/Other is less accurate than Same.
- Other can be just as bad as featureless baseline (or worse) when the
subsets have different patterns.

This is implemented in =ResamplingSameOtherSizesCV= when you use it on
a task that defines the =subset= role, for example the Arizona trees
Expand All @@ -60,16 +62,16 @@ experiment. The rows 12,15,18 below represent splits that attempt to
answer that question (test.subset=S, train.subsets=other).

#+begin_src R
> same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
> task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
> task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
> task.obj$col_roles$strata <- "y" #keep data proportional when splitting.
> task.obj$col_roles$stratum <- "y" #keep data proportional when splitting.
> task.obj$col_roles$group <- "polygon" #keep data together when splitting.
> task.obj$col_roles$subset <- "region3" #fix one test region, train on same/other/all region(s).
> same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
> same_other_sizes_cv$instantiate(task.obj)
> same_other_sizes_cv$instance$iteration.dt[, .(test.subset, train.subsets, test.fold)]
test.subset train.subsets test.fold
<char> <char> <int>
<char> <char> <int>
1: NE all 1
2: NW all 1
3: S all 1
Expand Down Expand Up @@ -105,18 +107,20 @@ The rows in the output above represent different kinds of splits:
- train.subsets=same is used as a baseline.
- train.subsets=all is used to answer the question, "is it beneficial
to combine all subsets when training?"
- train.subsets=other is used to answer the question, "can we train on
one subset, and accurately predict on another?"

Code to re-run:

#+begin_src R
data(AZtrees,package="mlr3resampling")
table(AZtrees$region3)
same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
task.obj$col_roles$strata <- "y" #keep data proportional when splitting.
task.obj$col_roles$stratum <- "y" #keep data proportional when splitting.
task.obj$col_roles$group <- "polygon" #keep data together when splitting.
task.obj$col_roles$subset <- "region3" #fix one test region, train on same/other/all region(s).
same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
same_other_sizes_cv$instantiate(task.obj)
same_other_sizes_cv$instance$iteration.dt[, .(test.subset, train.subsets, test.fold)]
#+end_src
Expand All @@ -139,14 +143,18 @@ below,
#+end_src

The output above indicates we have 5956 rows and 189 polygons. We can
do cross-validation on either polygons (if task has =group= role) or
rows (if no =group= role set). The code below sets a down-sampling
do cross-validation on either polygons (if task has =subset= role) or
rows (if no =subset= role set). The code below sets a down-sampling
=ratio= of 0.8, and four =sizes= of down-sampled train sets.

#+begin_src R
> same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
> same_other_sizes_cv$param_set$values$ratio <- 0.8
> same_other_sizes_cv$param_set$values$sizes <- 4
> same_other_sizes_cv$param_set$values$ratio <- 0.8
> task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
> task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
> task.obj$col_roles$stratum <- "y" #keep data proportional when splitting.
> task.obj$col_roles$group <- "polygon" #keep data together when splitting.
> same_other_sizes_cv$instantiate(task.obj)
> same_other_sizes_cv$instance$iteration.dt[, .(n.train.groups, test.fold)]
n.train.groups test.fold
Expand Down Expand Up @@ -180,28 +188,22 @@ Code to re-run:
data(AZtrees,package="mlr3resampling")
dim(AZtrees)
length(unique(AZtrees$polygon))
task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
task.obj$col_roles$strata <- "y" #keep data proportional when splitting.
task.obj$col_roles$group <- "polygon" #keep data together when splitting.
same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new()
same_other_sizes_cv$param_set$values$sizes <- 4
same_other_sizes_cv$param_set$values$ratio <- 0.8
task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y")
task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE)
task.obj$col_roles$stratum <- "y" #keep data proportional when splitting.
task.obj$col_roles$group <- "polygon" #keep data together when splitting.
same_other_sizes_cv$instantiate(task.obj)
same_other_sizes_cv$instance$iteration.dt[, .(n.train.groups, test.fold)]
#+end_src

*** Older Usage Examples and Discussion

Older examples in [[https://github.com/tdhock/mlr3resampling/blob/main/vignettes/Older_resamplers.Rmd][Older resamplers vignette]] (useful for visualization).

The examples linked below have examples with larger data sizes than
the examples in the CRAN vignettes linked above.

- https://tdhock.github.io/blog/2023/R-gen-new-subsets/
- https://tdhock.github.io/blog/2023/variable-size-train/

** Related work

mlr3resampling code was copied/modified from Resampling and
ResamplingCV classes in the excellent [[https://github.com/mlr-org/mlr3][mlr3]] package.
- mlr3resampling code was copied/modified from Resampling and
ResamplingCV classes in the excellent [[https://github.com/mlr-org/mlr3][mlr3]] package.
- As of Oct 2024, scikit-learn in python implements support for groups
via [[https://scikit-learn.org/stable/modules/cross_validation.html#group-k-fold][GroupKFold]] (keeping samples together when splitting) but not
subsets (test data come from one subset, train data come from
Same/Other/All subsets).
59 changes: 39 additions & 20 deletions man/ResamplingSameOtherCV.Rd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
\name{ResamplingSameOtherCV}
\alias{ResamplingSameOtherCV}
\title{Resampling for comparing training on same or other groups}
\title{Resampling for comparing training on same or other subsets}
\description{
\code{\link{ResamplingSameOtherCV}}
defines how a task is partitioned for
Expand All @@ -10,28 +10,38 @@

Resampling objects can be instantiated on a
\code{\link[mlr3:Task]{Task}},
which should define at least one group variable.
which should define at least one subset variable.

After instantiation, sets can be accessed via
\verb{$train_set(i)} and
\verb{$test_set(i)}, respectively.
}
\details{
This provides an implementation of SOAK, Same/Other/All K-fold
cross-validation. After instantiation, this class provides information
in \verb{$instance} that can be used for visualizing the
splits, as shown in the vignette. Most typical machine learning users
should instead use
\code{\link{ResamplingSameOtherSizesCV}}, which does not support these
visualization features, but provides other relevant machine learning
features, such as group role, which is not supported by
\code{\link{ResamplingSameOtherCV}}.

A supervised learning algorithm inputs a train set, and outputs a
prediction function, which can be used on a test set. If each data
point belongs to a group (such as geographic region, year, etc), then
how do we know if it is possible to train on one group, and predict
accurately on another group? Cross-validation can be used to determine
point belongs to a subset (such as geographic region, year, etc), then
how do we know if it is possible to train on one subset, and predict
accurately on another subset? Cross-validation can be used to determine
the extent to which this is possible, by first assigning fold IDs from
1 to K to all data (possibly using stratification, usually by group
and label). Then we loop over test sets (group/fold combinations),
train sets (same group, other groups, all groups), and compute
1 to K to all data (possibly using stratification, usually by subset
and label). Then we loop over test sets (subset/fold combinations),
train sets (same subset, other subsets, all subsets), and compute
test/prediction accuracy for each combination. Comparing
test/prediction accuracy between same and other, we can determine the
extent to which it is possible (perfect if same/other have similar
test accuracy for each group; other is usually somewhat less accurate
test accuracy for each subset; other is usually somewhat less accurate
than same; other can be just as bad as featureless baseline when the
groups have different patterns).
subsets have different patterns).
}
\section{Stratification}{

Expand All @@ -44,21 +54,28 @@ each combination of the values of the stratification variables forms a stratum.

\section{Grouping}{

\code{\link{ResamplingSameOtherCV}} supports grouping of observations.
The grouping variable is assumed to be discrete,
and must be stored in the \link[mlr3:Task]{Task} with column role \code{"group"}.
\code{\link{ResamplingSameOtherCV}} does not support grouping of
observations that should not be split in cross-validation.
See \code{\link{ResamplingSameOtherSizesCV}} for another sampler which
does support both \code{group} and \code{subset} roles.

}

\section{Subsets}{

The subset variable is assumed to be discrete,
and must be stored in the \link[mlr3:Task]{Task} with column role \code{"subset"}.
The number of cross-validation folds K should be defined as the
\code{fold} parameter.

In each group, there will be about an equal number of observations
In each subset, there will be about an equal number of observations
assigned to each of the K folds.
The assignments are stored in
\verb{$instance$id.dt}.
The train/test splits are defined by all possible combinations of
test group, test fold, and train groups (same/other/all).
test subset, test fold, and train subsets (Same/Other/All).
The splits are stored in
\verb{$instance$iteration.dt}.

}

\examples{
Expand All @@ -67,12 +84,14 @@ same_other$param_set$values$folds <- 5
}
\seealso{
\itemize{
\item Blog post
\url{https://tdhock.github.io/blog/2023/R-gen-new-subsets/}
\item arXiv paper \url{https://arxiv.org/abs/2410.08643} describing
SOAK algorithm.
\item Articles
\url{https://github.com/tdhock/mlr3resampling/wiki/Articles}
\item Package \CRANpkg{mlr3} for standard
\code{\link[mlr3:Resampling]{Resampling}}, which does not support comparing
train on same or other groups.
\item \code{\link{score}} and Simulations vignette for more detailed examples.
train on Same/Other/All subsets.
\item \code{vignette(package="mlr3resampling")} for more detailed examples.
}
}
\concept{Resampling}
Expand Down
37 changes: 21 additions & 16 deletions man/ResamplingSameOtherSizesCV.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -10,28 +10,29 @@

Resampling objects can be instantiated on a
\code{\link[mlr3:Task]{Task}},
which should define at least one group variable.
which can use the \code{subset} role.

After instantiation, sets can be accessed via
\verb{$train_set(i)} and
\verb{$test_set(i)}, respectively.
}
\details{
This is an implementation of SOAK, Same/Other/All K-fold cross-validation.
A supervised learning algorithm inputs a train set, and outputs a
prediction function, which can be used on a test set. If each data
point belongs to a group (such as geographic region, year, etc), then
how do we know if it is possible to train on one group, and predict
accurately on another group? Cross-validation can be used to determine
point belongs to a subset (such as geographic region, year, etc), then
how do we know if it is possible to train on one subset, and predict
accurately on another subset? Cross-validation can be used to determine
the extent to which this is possible, by first assigning fold IDs from
1 to K to all data (possibly using stratification, usually by group
and label). Then we loop over test sets (group/fold combinations),
train sets (same group, other groups, all groups), and compute
1 to K to all data (possibly using stratification, usually by subset
and label). Then we loop over test sets (subset/fold combinations),
train sets (same subset, other subsets, all subsets), and compute
test/prediction accuracy for each combination. Comparing
test/prediction accuracy between same and other, we can determine the
extent to which it is possible (perfect if same/other have similar
test accuracy for each group; other is usually somewhat less accurate
test accuracy for each subset; other is usually somewhat less accurate
than same; other can be just as bad as featureless baseline when the
groups have different patterns).
subsets have different patterns).

This class has more parameters/potential applications than
\code{\link{ResamplingSameOtherCV}} and
Expand All @@ -50,9 +51,11 @@ each combination of the values of the stratification variables forms a stratum.

\section{Grouping}{

\code{\link{ResamplingSameOtherSizesCV}} supports grouping of observations.
\code{\link{ResamplingSameOtherSizesCV}} supports grouping of
observations that will not be split in cross-validation.
The grouping variable is assumed to be discrete,
and must be stored in the \link[mlr3:Task]{Task} with column role \code{"group"}.
and must be stored in the \link[mlr3:Task]{Task} with column role
\code{"group"}.

}

Expand Down Expand Up @@ -87,7 +90,7 @@ The \code{ignore_subset} parameter should be either \code{TRUE} or
role. \code{TRUE} only creates splits for same subset (even if task
defines \code{subset} role), and is useful for subtrain/validation
splits (hyper-parameter learning). Note that this feature will work on a
task with \code{stratum} and \code{group} roles (unlike
task with both \code{stratum} and \code{group} roles (unlike
\code{ResamplingCV}).

In each subset, there will be about an equal number of observations
Expand All @@ -105,12 +108,14 @@ same_other_sizes$param_set$values$folds <- 5
}
\seealso{
\itemize{
\item Blog post
\url{https://tdhock.github.io/blog/2023/R-gen-new-subsets/}
\item arXiv paper \url{https://arxiv.org/abs/2410.08643} describing
SOAK algorithm.
\item Articles
\url{https://github.com/tdhock/mlr3resampling/wiki/Articles}
\item Package \CRANpkg{mlr3} for standard
\code{\link[mlr3:Resampling]{Resampling}}, which does not support comparing
train on same or other groups.
\item \code{\link{score}} and Simulations vignette for more detailed examples.
train on Same/Other/All subsets.
\item \code{vignette(package="mlr3resampling")} for more detailed examples.
}
}
\concept{Resampling}
Expand Down

0 comments on commit f06b107

Please sign in to comment.