Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add sparse data design document #79

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions _freeze/learn/develop/sparse-data/index/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"hash": "42627fd21aa3f55ccaac4f49a9065bb3",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: \"How Sparse Data are Used in tidymodels\"\ncategories:\n - sparse data\ntype: learn-subsection\nweight: 1\ndescription: | \n Design decisions around the use of sparse data in tidymodels.\ntoc: true\ntoc-depth: 2\ninclude-after-body: ../../../resources.html\n---\n\n\n\n\n\n\n\n\n## What is sparse data?\n\nWe use the term **sparse data** to denote a data set that contains a lot of 0s. Such data is commonly seen as a result of dealing with categorical variables, text tokenization, or graph data sets. The word sparse describes how the information is packed. Namely, it represents the presence of a lot of zeroes. For some tasks, we can easily get above 99% percent of 0s in the predictors. \n\nThe reason we use sparse data as a construct is that it is a lot more memory efficient to store the positions and values of the non-zero entries than to encode all the values. One could think of this as a compression, but one that is done such that data tasks are still fast. The following vector requires 25 values to store it normally (dense representation). This representation will be referred to as a **dense vector**.\n\n```r\nc(100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0)\n```\nThe sparse representation of this vector only requires 5 values. 1 value for the length (25), 2 values for the locations of the non-zero values (1, 22), and 2 values for the non-zero values (100, 1). This idea can also be extended to matrices as is done in the Matrix package.\n\nThe Matrix package implements sparse vectors and sparse matrices as S4 objects. These objects are the go-to objects in the R ecosystem for dealing with sparse data.\n\nThe [tibble](https://tibble.tidyverse.org/) is used as the main carrier of data inside tidymodels. While matrices or data frames are accepted as input in different places, they are converted to tibbles early to take advantage of the benefits and checks they provide. This is where the [sparsevctrs](https://github.com/r-lib/sparsevctrs) package comes in. The sparse vectors and matrices from the Matrix package don't work with data frames or tibbles and are thus not able to be used well within tidymodels. The sparsevctrs package allows for the creation of [ALTREP](https://svn.r-project.org/R/branches/ALTREP/ALTREP.html) vectors that act like normal numeric vectors but are encoded sparsely. If an operation on the vector can't be done sparsely, it will fall back and **materialize**. Materialization means that it will generate and cache a dense version of the vector that is then used. \n\nThese sparse vectors can be put into tibbles and are what allow tidymodels to handle sparse data. We will henceforth refer to a tibble that contains sparse columns as a **sparse tibble**. Sparse matrices require all the elements to be of the same type. All logical, all integer, or all doubles. This limitation does not apply to sparse tibbles as you can store both sparse vectors and dense vectors, but also mix and match numeric, factors, datetimes, and more.\n\nThe sparsity mostly matters for the predictors. The outcomes, case weights, and predictions are done densely.\n\nBelow is outlined how sparse data (matrix & tibble) works with the various tidymodels packages.\n\n## rsample\n\nThe resampling functions from the rsample package work with sparse tibbles out of the box. However, they won't work with sparse matrices. Instead, use the `coerce_to_sparse_tibble()` function from the sparsevctrs package to turn the sparse matrix into a sparse tibble and proceed with that object.\n\n```r\nlibrary(sparsevctrs)\ndata_tbl <- coerce_to_sparse_tibble(data_mat)\n```\n\n## recipes\n\nThe recipes package receives data in `recipe()`, `prep()`, and `bake()`. These functions all handle sparse data in the same manner.\n\nSparse tibbles should work as normal. Sparse matrices are accepted and then turned into sparse tibbles right away, and flow through the internals as sparse tibbles.\n\nThe `composition` argument of `bake()` understands sparse tibbles. When `composition = \"dgCMatrix\"` the resulting sparse matrix will be created from the sparse tibble with minimal overhead.\n\nRecipe steps themselves don't know how to handle sparse vectors yet. In most cases, this means that sparse vectors will be materialized. This is a known issue and is planned to be fixed.\n\n## parsnip\n\nThe parsnip package receives data in `fit()`, `fit_xy()`, and `predict()`.\n\nSparse data does not yet work with the formula interface for `fit()`. This is a known issue and is expected to be fixed long term.\n\nFor `fit_xy()`, both sparse tibbles and sparse matrices are supported. Sparse matrices are turned into sparse tibbles early on. When fitting a model in parsnip, the package checks whether the engine supports sparse matrices using the `allow_sparse_x` specification. A warning is thrown if sparse data is passed to an engine that doesn't support it, informing the user of that fact and that the data will be converted to a dense representation.\n\n`predict()` works with sparse tibbles and sparse matrices, where sparse matrices are turned into sparse tibbles right away, and into the appropriate format before it is passed to the model engine.\n\n## workflows\n\nThe workflows package receives data in `fit()` and `predict()`.\n\nSparse data does not yet work with the formula interface for `fit()`. This is a known issue and is expected to be fixed long term.\n\nBoth `fit()` and `predict()` work with sparse matrices and sparse tibbles when using recipes or variables as preprocessors, turning sparse matrics into sparse tibbles at the earliest convenience. Most of the checking and functionality is delegated to recipes and parsnip.\n\n## All other packages\n\nAll other packages should work out of the box with sparse tibbles.\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
82 changes: 82 additions & 0 deletions learn/develop/sparse-data/index.html.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
---
title: "How Sparse Data are Used in tidymodels"
categories:
- sparse data
type: learn-subsection
weight: 1
description: |
Design decisions around the use of sparse data in tidymodels.
toc: true
toc-depth: 2
include-after-body: ../../../resources.html
---








EmilHvitfeldt marked this conversation as resolved.
Show resolved Hide resolved

## What is sparse data?

We use the term **sparse data** to denote a data set that contains a lot of 0s. Such data is commonly seen as a result of dealing with categorical variables, text tokenization, or graph data sets. The word sparse describes how the information is packed. Namely, it represents the presence of a lot of zeroes. For some tasks, we can easily get above 99% percent of 0s in the predictors.

The reason we use sparse data as a construct is that it is a lot more memory efficient to store the positions and values of the non-zero entries than to encode all the values. One could think of this as a compression, but one that is done such that data tasks are still fast. The following vector requires 25 values to store it normally (dense representation). This representation will be referred to as a **dense vector**.

```r
c(100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0)
```
The sparse representation of this vector only requires 5 values. 1 value for the length (25), 2 values for the locations of the non-zero values (1, 22), and 2 values for the non-zero values (100, 1). This idea can also be extended to matrices as is done in the Matrix package.

The Matrix package implements sparse vectors and sparse matrices as S4 objects. These objects are the go-to objects in the R ecosystem for dealing with sparse data.

The [tibble](https://tibble.tidyverse.org/) is used as the main carrier of data inside tidymodels. While matrices or data frames are accepted as input in different places, they are converted to tibbles early to take advantage of the benefits and checks they provide. This is where the [sparsevctrs](https://github.com/r-lib/sparsevctrs) package comes in. The sparse vectors and matrices from the Matrix package don't work with data frames or tibbles and are thus not able to be used well within tidymodels. The sparsevctrs package allows for the creation of [ALTREP](https://svn.r-project.org/R/branches/ALTREP/ALTREP.html) vectors that act like normal numeric vectors but are encoded sparsely. If an operation on the vector can't be done sparsely, it will fall back and **materialize**. Materialization means that it will generate and cache a dense version of the vector that is then used.

These sparse vectors can be put into tibbles and are what allow tidymodels to handle sparse data. We will henceforth refer to a tibble that contains sparse columns as a **sparse tibble**. Sparse matrices require all the elements to be of the same type. All logical, all integer, or all doubles. This limitation does not apply to sparse tibbles as you can store both sparse vectors and dense vectors, but also mix and match numeric, factors, datetimes, and more.

The sparsity mostly matters for the predictors. The outcomes, case weights, and predictions are done densely.

Below is outlined how sparse data (matrix & tibble) works with the various tidymodels packages.

## rsample

The resampling functions from the rsample package work with sparse tibbles out of the box. However, they won't work with sparse matrices. Instead, use the `coerce_to_sparse_tibble()` function from the sparsevctrs package to turn the sparse matrix into a sparse tibble and proceed with that object.

```r
library(sparsevctrs)
data_tbl <- coerce_to_sparse_tibble(data_mat)
```

## recipes

The recipes package receives data in `recipe()`, `prep()`, and `bake()`. These functions all handle sparse data in the same manner.

Sparse tibbles should work as normal. Sparse matrices are accepted and then turned into sparse tibbles right away, and flow through the internals as sparse tibbles.

The `composition` argument of `bake()` understands sparse tibbles. When `composition = "dgCMatrix"` the resulting sparse matrix will be created from the sparse tibble with minimal overhead.

Recipe steps themselves don't know how to handle sparse vectors yet. In most cases, this means that sparse vectors will be materialized. This is a known issue and is planned to be fixed.

## parsnip

The parsnip package receives data in `fit()`, `fit_xy()`, and `predict()`.

Sparse data does not yet work with the formula interface for `fit()`. This is a known issue and is expected to be fixed long term.

For `fit_xy()`, both sparse tibbles and sparse matrices are supported. Sparse matrices are turned into sparse tibbles early on. When fitting a model in parsnip, the package checks whether the engine supports sparse matrices using the `allow_sparse_x` specification. A warning is thrown if sparse data is passed to an engine that doesn't support it, informing the user of that fact and that the data will be converted to a dense representation.

`predict()` works with sparse tibbles and sparse matrices, where sparse matrices are turned into sparse tibbles right away, and into the appropriate format before it is passed to the model engine.

## workflows

The workflows package receives data in `fit()` and `predict()`.

Sparse data does not yet work with the formula interface for `fit()`. This is a known issue and is expected to be fixed long term.

Both `fit()` and `predict()` work with sparse matrices and sparse tibbles when using recipes or variables as preprocessors, turning sparse matrics into sparse tibbles at the earliest convenience. Most of the checking and functionality is delegated to recipes and parsnip.

## All other packages

All other packages should work out of the box with sparse tibbles.
92 changes: 92 additions & 0 deletions learn/develop/sparse-data/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
---
title: "How Sparse Data are Used in tidymodels"
categories:
- sparse data
type: learn-subsection
weight: 1
description: |
Design decisions around the use of sparse data in tidymodels.
toc: true
toc-depth: 2
include-after-body: ../../../resources.html
---

```{r}
#| label: "setup"
#| include: false
#| message: false
#| warning: false
source(here::here("common.R"))
set.seed(1234)
```

```{r}
#| label: "ex_setup"
#| include: false
library(tidymodels)
library(modeldata)
pkgs <- c("tidymodels", "modeldata")
theme_set(theme_bw() + theme(legend.position = "top"))
```

## What is sparse data?

We use the term **sparse data** to denote a data set that contains a lot of 0s. Such data is commonly seen as a result of dealing with categorical variables, text tokenization, or graph data sets. The word sparse describes how the information is packed. Namely, it represents the presence of a lot of zeroes. For some tasks, we can easily get above 99% percent of 0s in the predictors.

The reason we use sparse data as a construct is that it is a lot more memory efficient to store the positions and values of the non-zero entries than to encode all the values. One could think of this as a compression, but one that is done such that data tasks are still fast. The following vector requires 25 values to store it normally (dense representation). This representation will be referred to as a **dense vector**.

```r
c(100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0)
```
The sparse representation of this vector only requires 5 values. 1 value for the length (25), 2 values for the locations of the non-zero values (1, 22), and 2 values for the non-zero values (100, 1). This idea can also be extended to matrices as is done in the Matrix package.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an additional value for the default, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left out that detail here. It is technically not needed, but something I have added specifically in {sparsevctrs} to allow for some cool interactions.


The Matrix package implements sparse vectors and sparse matrices as S4 objects. These objects are the go-to objects in the R ecosystem for dealing with sparse data.

The [tibble](https://tibble.tidyverse.org/) is used as the main carrier of data inside tidymodels. While matrices or data frames are accepted as input in different places, they are converted to tibbles early to take advantage of the benefits and checks they provide. This is where the [sparsevctrs](https://github.com/r-lib/sparsevctrs) package comes in. The sparse vectors and matrices from the Matrix package don't work with data frames or tibbles and are thus not able to be used well within tidymodels. The sparsevctrs package allows for the creation of [ALTREP](https://svn.r-project.org/R/branches/ALTREP/ALTREP.html) vectors that act like normal numeric vectors but are encoded sparsely. If an operation on the vector can't be done sparsely, it will fall back and **materialize**. Materialization means that it will generate and cache a dense version of the vector that is then used.

These sparse vectors can be put into tibbles and are what allow tidymodels to handle sparse data. We will henceforth refer to a tibble that contains sparse columns as a **sparse tibble**. Sparse matrices require all the elements to be of the same type. All logical, all integer, or all doubles. This limitation does not apply to sparse tibbles as you can store both sparse vectors and dense vectors, but also mix and match numeric, factors, datetimes, and more.

The sparsity mostly matters for the predictors. The outcomes, case weights, and predictions are done densely.

Below is outlined how sparse data (matrix & tibble) works with the various tidymodels packages.

## rsample

The resampling functions from the rsample package work with sparse tibbles out of the box. However, they won't work with sparse matrices. Instead, use the `coerce_to_sparse_tibble()` function from the sparsevctrs package to turn the sparse matrix into a sparse tibble and proceed with that object.

```r
library(sparsevctrs)
data_tbl <- coerce_to_sparse_tibble(data_mat)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be a comment in sparsevctrs but a better name would be coerce_matrix_to_sparse_tibble() or to make it an S3 method (the latter seems like a better option to me).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea (about s3 method). Added issue here: r-lib/sparsevctrs#78

```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming there's a conversation somewhere in rsample about this, but—can we happy path this? If it's really just a matter of coercing, can we do so internally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about it in a team meeting. The current plan was to ask people to manually convert sparse matrices to sparse tibbles, for use in {rsample} functions.

We could let {rsample} functions take sparse matrices or matrices and turn the data to tibbles, but IMO it feels like we are doing too much with {rsample} functions. But I'm also not holding on to this belief to tightly


## recipes

The recipes package receives data in `recipe()`, `prep()`, and `bake()`. These functions all handle sparse data in the same manner.

Sparse tibbles should work as normal. Sparse matrices are accepted and then turned into sparse tibbles right away, and flow through the internals as sparse tibbles.

The `composition` argument of `bake()` understands sparse tibbles. When `composition = "dgCMatrix"` the resulting sparse matrix will be created from the sparse tibble with minimal overhead.
EmilHvitfeldt marked this conversation as resolved.
Show resolved Hide resolved

Recipe steps themselves don't know how to handle sparse vectors yet. In most cases, this means that sparse vectors will be materialized. This is a known issue and is planned to be fixed.

## parsnip

The parsnip package receives data in `fit()`, `fit_xy()`, and `predict()`.

Sparse data does not yet work with the formula interface for `fit()`. This is a known issue and is expected to be fixed long term.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would require a rewrite of model.matrix(), no? (For Max and Hannah, more info.) This verbiage feels a bit optimistic to me.

Copy link
Member Author

@EmilHvitfeldt EmilHvitfeldt Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is correct

we can rewrite to be less optimistic 😄


For `fit_xy()`, both sparse tibbles and sparse matrices are supported. Sparse matrices are turned into sparse tibbles early on. When fitting a model in parsnip, the package checks whether the engine supports sparse matrices using the `allow_sparse_x` specification. A warning is thrown if sparse data is passed to an engine that doesn't support it, informing the user of that fact and that the data will be converted to a dense representation.

`predict()` works with sparse tibbles and sparse matrices, where sparse matrices are turned into sparse tibbles right away, and into the appropriate format before it is passed to the model engine.

## workflows

The workflows package receives data in `fit()` and `predict()`.

Sparse data does not yet work with the formula interface for `fit()`. This is a known issue and is expected to be fixed long term.

Both `fit()` and `predict()` work with sparse matrices and sparse tibbles when using recipes or variables as preprocessors, turning sparse matrics into sparse tibbles at the earliest convenience. Most of the checking and functionality is delegated to recipes and parsnip.

## All other packages

All other packages should work out of the box with sparse tibbles.
1 change: 1 addition & 0 deletions learn/index-listing.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
"/learn/statistics/survival-metrics/index.html",
"/start/resampling/index.html",
"/learn/work/fairness-readmission/index.html",
"/learn/develop/sparse-data/index.html",
"/learn/statistics/survival-case-study/index.html",
"/learn/develop/models/index.html",
"/learn/develop/parameters/index.html",
Expand Down
1 change: 1 addition & 0 deletions learn/index.html.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,6 @@ listing:




After you know [what you need to get started](/start/) with tidymodels, you can learn more and go further. Find articles here to help you solve specific problems using the tidymodels framework.

4 changes: 2 additions & 2 deletions site_libs/bootstrap/bootstrap.min.css

Large diffs are not rendered by default.