tidymodels · EmilHvitfeldt · Oct 2, 2024 · Oct 7, 2024 · Oct 7, 2024 · Oct 7, 2024
diff --git a/_freeze/learn/develop/sparse-data/index/execute-results/html.json b/_freeze/learn/develop/sparse-data/index/execute-results/html.json
@@ -0,0 +1,15 @@
+{
+  "hash": "42627fd21aa3f55ccaac4f49a9065bb3",
+  "result": {
+    "engine": "knitr",
+    "markdown": "---\ntitle: \"How Sparse Data are Used in tidymodels\"\ncategories:\n - sparse data\ntype: learn-subsection\nweight: 1\ndescription: | \n  Design decisions around the use of sparse data in tidymodels.\ntoc: true\ntoc-depth: 2\ninclude-after-body: ../../../resources.html\n---\n\n\n\n\n\n\n\n\n## What is sparse data?\n\nWe use the term **sparse data** to denote a data set that contains a lot of 0s. Such data is commonly seen as a result of dealing with categorical variables, text tokenization, or graph data sets. The word sparse describes how the information is packed. Namely, it represents the presence of a lot of zeroes. For some tasks, we can easily get above 99% percent of 0s in the predictors. \n\nThe reason we use sparse data as a construct is that it is a lot more memory efficient to store the positions and values of the non-zero entries than to encode all the values. One could think of this as a compression, but one that is done such that data tasks are still fast. The following vector requires 25 values to store it normally (dense representation). This representation will be referred to as a **dense vector**.\n\n```r\nc(100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0)\n```\nThe sparse representation of this vector only requires 5 values. 1 value for the length (25), 2 values for the locations of the non-zero values (1, 22), and 2 values for the non-zero values (100, 1). This idea can also be extended to matrices as is done in the Matrix package.\n\nThe Matrix package implements sparse vectors and sparse matrices as S4 objects. These objects are the go-to objects in the R ecosystem for dealing with sparse data.\n\nThe [tibble](https://tibble.tidyverse.org/) is used as the main carrier of data inside tidymodels. While matrices or data frames are accepted as input in different places, they are converted to tibbles early to take advantage of the benefits and checks they provide. This is where the [sparsevctrs](https://github.com/r-lib/sparsevctrs) package comes in. The sparse vectors and matrices from the Matrix package don't work with data frames or tibbles and are thus not able to be used well within tidymodels. The sparsevctrs package allows for the creation of [ALTREP](https://svn.r-project.org/R/branches/ALTREP/ALTREP.html) vectors that act like normal numeric vectors but are encoded sparsely. If an operation on the vector can't be done sparsely, it will fall back and **materialize**. Materialization means that it will generate and cache a dense version of the vector that is then used. \n\nThese sparse vectors can be put into tibbles and are what allow tidymodels to handle sparse data. We will henceforth refer to a tibble that contains sparse columns as a **sparse tibble**. Sparse matrices require all the elements to be of the same type. All logical, all integer, or all doubles. This limitation does not apply to sparse tibbles as you can store both sparse vectors and dense vectors, but also mix and match numeric, factors, datetimes, and more.\n\nThe sparsity mostly matters for the predictors. The outcomes, case weights, and predictions are done densely.\n\nBelow is outlined how sparse data (matrix & tibble) works with the various tidymodels packages.\n\n## rsample\n\nThe resampling functions from the rsample package work with sparse tibbles out of the box. However, they won't work with sparse matrices. Instead, use the `coerce_to_sparse_tibble()` function from the sparsevctrs package to turn the sparse matrix into a sparse tibble and proceed with that object.\n\n```r\nlibrary(sparsevctrs)\ndata_tbl <- coerce_to_sparse_tibble(data_mat)\n```\n\n## recipes\n\nThe recipes package receives data in `recipe()`, `prep()`, and `bake()`. These functions all handle sparse data in the same manner.\n\nSparse tibbles should work as normal. Sparse matrices are accepted and then turned into sparse tibbles right away, and flow through the internals as sparse tibbles.\n\nThe `composition` argument of `bake()` understands sparse tibbles. When `composition = \"dgCMatrix\"` the resulting sparse matrix will be created from the sparse tibble with minimal overhead.\n\nRecipe steps themselves don't know how to handle sparse vectors yet. In most cases, this means that sparse vectors will be materialized. This is a known issue and is planned to be fixed.\n\n## parsnip\n\nThe parsnip package receives data in `fit()`, `fit_xy()`, and `predict()`.\n\nSparse data does not yet work with the formula interface for `fit()`. This is a known issue and is expected to be fixed long term.\n\nFor `fit_xy()`, both sparse tibbles and sparse matrices are supported. Sparse matrices are turned into sparse tibbles early on. When fitting a model in parsnip, the package checks whether the engine supports sparse matrices using the `allow_sparse_x` specification. A warning is thrown if sparse data is passed to an engine that doesn't support it, informing the user of that fact and that the data will be converted to a dense representation.\n\n`predict()` works with sparse tibbles and sparse matrices, where sparse matrices are turned into sparse tibbles right away, and into the appropriate format before it is passed to the model engine.\n\n## workflows\n\nThe workflows package receives data in `fit()` and `predict()`.\n\nSparse data does not yet work with the formula interface for `fit()`. This is a known issue and is expected to be fixed long term.\n\nBoth `fit()` and `predict()` work with sparse matrices and sparse tibbles when using recipes or variables as preprocessors, turning sparse matrics into sparse tibbles at the earliest convenience. Most of the checking and functionality is delegated to recipes and parsnip.\n\n## All other packages\n\nAll other packages should work out of the box with sparse tibbles.\n",
+    "supporting": [],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {},
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}
diff --git a/learn/develop/sparse-data/index.html.md b/learn/develop/sparse-data/index.html.md
@@ -0,0 +1,82 @@
+---
+title: "How Sparse Data are Used in tidymodels"
+categories:
+ - sparse data
+type: learn-subsection
+weight: 1
+description: | 
+  Design decisions around the use of sparse data in tidymodels.
+toc: true
+toc-depth: 2
+include-after-body: ../../../resources.html
+---
+
+
+
+
+
+
+
+
+
+## What is sparse data?
+
+We use the term **sparse data** to denote a data set that contains a lot of 0s. Such data is commonly seen as a result of dealing with categorical variables, text tokenization, or graph data sets. The word sparse describes how the information is packed. Namely, it represents the presence of a lot of zeroes. For some tasks, we can easily get above 99% percent of 0s in the predictors. 
+
+The reason we use sparse data as a construct is that it is a lot more memory efficient to store the positions and values of the non-zero entries than to encode all the values. One could think of this as a compression, but one that is done such that data tasks are still fast. The following vector requires 25 values to store it normally (dense representation). This representation will be referred to as a **dense vector**.
+
+```r
+c(100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0)
+```
+The sparse representation of this vector only requires 5 values. 1 value for the length (25), 2 values for the locations of the non-zero values (1, 22), and 2 values for the non-zero values (100, 1). This idea can also be extended to matrices as is done in the Matrix package.
+
+The Matrix package implements sparse vectors and sparse matrices as S4 objects. These objects are the go-to objects in the R ecosystem for dealing with sparse data.
+
+The [tibble](https://tibble.tidyverse.org/) is used as the main carrier of data inside tidymodels. While matrices or data frames are accepted as input in different places, they are converted to tibbles early to take advantage of the benefits and checks they provide. This is where the [sparsevctrs](https://github.com/r-lib/sparsevctrs) package comes in. The sparse vectors and matrices from the Matrix package don't work with data frames or tibbles and are thus not able to be used well within tidymodels. The sparsevctrs package allows for the creation of [ALTREP](https://svn.r-project.org/R/branches/ALTREP/ALTREP.html) vectors that act like normal numeric vectors but are encoded sparsely. If an operation on the vector can't be done sparsely, it will fall back and **materialize**. Materialization means that it will generate and cache a dense version of the vector that is then used. 
+
+These sparse vectors can be put into tibbles and are what allow tidymodels to handle sparse data. We will henceforth refer to a tibble that contains sparse columns as a **sparse tibble**. Sparse matrices require all the elements to be of the same type. All logical, all integer, or all doubles. This limitation does not apply to sparse tibbles as you can store both sparse vectors and dense vectors, but also mix and match numeric, factors, datetimes, and more.
+
+The sparsity mostly matters for the predictors. The outcomes, case weights, and predictions are done densely.
+
+Below is outlined how sparse data (matrix & tibble) works with the various tidymodels packages.
+
+## rsample
+
+The resampling functions from the rsample package work with sparse tibbles out of the box. However, they won't work with sparse matrices. Instead, use the `coerce_to_sparse_tibble()` function from the sparsevctrs package to turn the sparse matrix into a sparse tibble and proceed with that object.
+
+```r
+library(sparsevctrs)
+data_tbl <- coerce_to_sparse_tibble(data_mat)
+```
+
+## recipes
+
+The recipes package receives data in `recipe()`, `prep()`, and `bake()`. These functions all handle sparse data in the same manner.
+
+Sparse tibbles should work as normal. Sparse matrices are accepted and then turned into sparse tibbles right away, and flow through the internals as sparse tibbles.
+
+The `composition` argument of `bake()` understands sparse tibbles. When `composition = "dgCMatrix"` the resulting sparse matrix will be created from the sparse tibble with minimal overhead.
+
+Recipe steps themselves don't know how to handle sparse vectors yet. In most cases, this means that sparse vectors will be materialized. This is a known issue and is planned to be fixed.
+
+## parsnip
+
+The parsnip package receives data in `fit()`, `fit_xy()`, and `predict()`.
+
+Sparse data does not yet work with the formula interface for `fit()`. This is a known issue and is expected to be fixed long term.
+
+For `fit_xy()`, both sparse tibbles and sparse matrices are supported. Sparse matrices are turned into sparse tibbles early on. When fitting a model in parsnip, the package checks whether the engine supports sparse matrices using the `allow_sparse_x` specification. A warning is thrown if sparse data is passed to an engine that doesn't support it, informing the user of that fact and that the data will be converted to a dense representation.
+
+`predict()` works with sparse tibbles and sparse matrices, where sparse matrices are turned into sparse tibbles right away, and into the appropriate format before it is passed to the model engine.
+
+## workflows
+
+The workflows package receives data in `fit()` and `predict()`.
+
+Sparse data does not yet work with the formula interface for `fit()`. This is a known issue and is expected to be fixed long term.
+
+Both `fit()` and `predict()` work with sparse matrices and sparse tibbles when using recipes or variables as preprocessors, turning sparse matrics into sparse tibbles at the earliest convenience. Most of the checking and functionality is delegated to recipes and parsnip.
+
+## All other packages
+
+All other packages should work out of the box with sparse tibbles.
diff --git a/learn/develop/sparse-data/index.qmd b/learn/develop/sparse-data/index.qmd
@@ -0,0 +1,92 @@
+---
+title: "How Sparse Data are Used in tidymodels"
+categories:
+ - sparse data
+type: learn-subsection
+weight: 1
+description: | 
+  Design decisions around the use of sparse data in tidymodels.
+toc: true
+toc-depth: 2
+include-after-body: ../../../resources.html
+---
+
+```{r}
+#| label: "setup"
+#| include: false
+#| message: false
+#| warning: false
+source(here::here("common.R"))
+set.seed(1234)
+```
+
+```{r}
+#| label: "ex_setup"
+#| include: false
+library(tidymodels)
+library(modeldata)
+pkgs <- c("tidymodels", "modeldata")
+theme_set(theme_bw() + theme(legend.position = "top"))
+```
+
+## What is sparse data?
+
+We use the term **sparse data** to denote a data set that contains a lot of 0s. Such data is commonly seen as a result of dealing with categorical variables, text tokenization, or graph data sets. The word sparse describes how the information is packed. Namely, it represents the presence of a lot of zeroes. For some tasks, we can easily get above 99% percent of 0s in the predictors. 
+
+The reason we use sparse data as a construct is that it is a lot more memory efficient to store the positions and values of the non-zero entries than to encode all the values. One could think of this as a compression, but one that is done such that data tasks are still fast. The following vector requires 25 values to store it normally (dense representation). This representation will be referred to as a **dense vector**.
+
+```r
+c(100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0)
+```
+The sparse representation of this vector only requires 5 values. 1 value for the length (25), 2 values for the locations of the non-zero values (1, 22), and 2 values for the non-zero values (100, 1). This idea can also be extended to matrices as is done in the Matrix package.
+
+The Matrix package implements sparse vectors and sparse matrices as S4 objects. These objects are the go-to objects in the R ecosystem for dealing with sparse data.
+
+The [tibble](https://tibble.tidyverse.org/) is used as the main carrier of data inside tidymodels. While matrices or data frames are accepted as input in different places, they are converted to tibbles early to take advantage of the benefits and checks they provide. This is where the [sparsevctrs](https://github.com/r-lib/sparsevctrs) package comes in. The sparse vectors and matrices from the Matrix package don't work with data frames or tibbles and are thus not able to be used well within tidymodels. The sparsevctrs package allows for the creation of [ALTREP](https://svn.r-project.org/R/branches/ALTREP/ALTREP.html) vectors that act like normal numeric vectors but are encoded sparsely. If an operation on the vector can't be done sparsely, it will fall back and **materialize**. Materialization means that it will generate and cache a dense version of the vector that is then used. 
+
+These sparse vectors can be put into tibbles and are what allow tidymodels to handle sparse data. We will henceforth refer to a tibble that contains sparse columns as a **sparse tibble**. Sparse matrices require all the elements to be of the same type. All logical, all integer, or all doubles. This limitation does not apply to sparse tibbles as you can store both sparse vectors and dense vectors, but also mix and match numeric, factors, datetimes, and more.
+
+The sparsity mostly matters for the predictors. The outcomes, case weights, and predictions are done densely.
+
+Below is outlined how sparse data (matrix & tibble) works with the various tidymodels packages.
+
+## rsample
+
+The resampling functions from the rsample package work with sparse tibbles out of the box. However, they won't work with sparse matrices. Instead, use the `coerce_to_sparse_tibble()` function from the sparsevctrs package to turn the sparse matrix into a sparse tibble and proceed with that object.
+
+```r
+library(sparsevctrs)
+data_tbl <- coerce_to_sparse_tibble(data_mat)
+```
+
+## recipes
+
+The recipes package receives data in `recipe()`, `prep()`, and `bake()`. These functions all handle sparse data in the same manner.
+
+Sparse tibbles should work as normal. Sparse matrices are accepted and then turned into sparse tibbles right away, and flow through the internals as sparse tibbles.
+
+The `composition` argument of `bake()` understands sparse tibbles. When `composition = "dgCMatrix"` the resulting sparse matrix will be created from the sparse tibble with minimal overhead.
+
+Recipe steps themselves don't know how to handle sparse vectors yet. In most cases, this means that sparse vectors will be materialized. This is a known issue and is planned to be fixed.
+
+## parsnip
+
+The parsnip package receives data in `fit()`, `fit_xy()`, and `predict()`.
+
+Sparse data does not yet work with the formula interface for `fit()`. This is a known issue and is expected to be fixed long term.
+
+For `fit_xy()`, both sparse tibbles and sparse matrices are supported. Sparse matrices are turned into sparse tibbles early on. When fitting a model in parsnip, the package checks whether the engine supports sparse matrices using the `allow_sparse_x` specification. A warning is thrown if sparse data is passed to an engine that doesn't support it, informing the user of that fact and that the data will be converted to a dense representation.
+
+`predict()` works with sparse tibbles and sparse matrices, where sparse matrices are turned into sparse tibbles right away, and into the appropriate format before it is passed to the model engine.
+
+## workflows
+
+The workflows package receives data in `fit()` and `predict()`.
+
+Sparse data does not yet work with the formula interface for `fit()`. This is a known issue and is expected to be fixed long term.
+
+Both `fit()` and `predict()` work with sparse matrices and sparse tibbles when using recipes or variables as preprocessors, turning sparse matrics into sparse tibbles at the earliest convenience. Most of the checking and functionality is delegated to recipes and parsnip.
+
+## All other packages
+
+All other packages should work out of the box with sparse tibbles.
diff --git a/learn/index-listing.json b/learn/index-listing.json
@@ -18,6 +18,7 @@
     "/learn/statistics/survival-metrics/index.html",
     "/start/resampling/index.html",
     "/learn/work/fairness-readmission/index.html",
+    "/learn/develop/sparse-data/index.html",
     "/learn/statistics/survival-case-study/index.html",
     "/learn/develop/models/index.html",
     "/learn/develop/parameters/index.html",

diff --git a/learn/index.html.md b/learn/index.html.md
@@ -18,5 +18,6 @@ listing:
 
 
 
+
 After you know [what you need to get started](/start/) with tidymodels, you can learn more and go further. Find articles here to help you solve specific problems using the tidymodels framework. 
 
diff --git a/site_libs/bootstrap/bootstrap.min.css b/site_libs/bootstrap/bootstrap.min.css
Original file line number	Diff line number	Diff line change
Expand Up		@@ -18,5 +18,6 @@ listing:




		After you know [what you need to get started](/start/) with tidymodels, you can learn more and go further. Find articles here to help you solve specific problems using the tidymodels framework.