Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse data modeling using a Matrix #80

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions _freeze/learn/work/sparse-matrix/index/execute-results/html.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"hash": "78387eefad49a91d8623a5fb76817f3c",
"result": {
"engine": "knitr",
"markdown": "---\ntitle: \"Model tuning using a sparse matrix\"\ncategories:\n - tuning\n - classification\n - sparse data\ntype: learn-subsection\nweight: 1\ndescription: | \n Fitting a model using tidymodels with a sparse matrix as the data.\ntoc: true\ntoc-depth: 2\ninclude-after-body: ../../../resources.html\n---\n\n\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: sparsevctrs and tidymodels.\n\nThis article demonstrates how we can use a sparse matrix in tidymodels.\n\n## Example data\n\nThe data we will be using in this article is a larger sample of the [small_fine_foods](https://modeldata.tidymodels.org/reference/small_fine_foods.html) data set from the [modeldata](https://modeldata.tidymodels.org) package. The [raw data](https://snap.stanford.edu/data/web-FineFoods.html) was sliced down to 100,000 rows, tokenized, and saved as a sparse matrix. Data has been saved as [reviews.rds](reviews.rds) and the code to generate this data set is found at [generate-data.R](generate-data.R). This file takes up around 1MB compressed, and around 12MB once loaded into R. This data set is encoded as a sparse matrix from the Matrix package; if we were to turn it into a dense matrix, it would take up 3GB.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nreviews <- readr::read_rds(\"reviews.rds\")\nreviews |> head()\n#> 6 x 24818 sparse Matrix of class \"dgCMatrix\"\n#> [[ suppressing 34 column names 'SCORE', 'a', 'all' ... ]]\n#> \n#> 1 1 2 1 3 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 ......\n#> 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 . . . . . . ......\n#> 3 . 4 . 6 . . . . . . . . . . . 1 5 3 . . . . . . . 2 . . . . . . . . ......\n#> 4 . . . 1 . . . . . . . . 1 1 1 4 1 1 . . . . . . . . . . . . . . . . ......\n#> 5 1 4 . . . . . . . . . . . . . . 1 . . . . . . . . 1 . . . . . . . . ......\n#> 6 . 3 1 2 . . . . . . . . . . . 2 1 1 . . . . . . 4 1 . . . . . . . . ......\n#> \n#> .....suppressing 24784 columns in show(); maybe adjust options(max.print=, width=)\n#> ..............................\n```\n:::\n\n\n\n\n## Modeling\n\nWe start by loading tidymodels and the sparsevctrs package. The sparsevctrs package includes some helper functions that will allow us to more easily work with sparse matrices in tidymodels.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(sparsevctrs)\n```\n:::\n\n\n\n\nWhile sparse matrices now work in parsnip, recipes, and workflows directly, we can use rsample's sampling functions as well if we turn it into a tibble. The usual `as_tibble()` would turn the object to a dense representation, greatly expanding the object size. However, sparsevctrs' `coerce_to_sparse_tibble()` will create a tibble with sparse columns, which we call a **sparse tibble**.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nreviews_tbl <- coerce_to_sparse_tibble(reviews)\nreviews_tbl\n#> # A tibble: 15,000 × 24,818\n#> SCORE a all and appreciates be better bought canned dog finicky\n#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>\n#> 1 1 2 1 3 1 1 2 1 1 1 1\n#> 2 0 0 0 0 0 0 0 0 0 0 0\n#> 3 0 4 0 6 0 0 0 0 0 0 0\n#> 4 0 0 0 1 0 0 0 0 0 0 0\n#> 5 1 4 0 0 0 0 0 0 0 0 0\n#> 6 0 3 1 2 0 0 0 0 0 0 0\n#> 7 1 1 0 3 0 0 0 0 0 0 0\n#> 8 1 0 0 1 0 0 0 0 0 0 0\n#> 9 1 0 0 1 0 0 0 0 0 0 0\n#> 10 1 1 0 0 0 0 0 0 0 2 0\n#> # ℹ 14,990 more rows\n#> # ℹ 24,807 more variables: food <dbl>, found <dbl>, good <dbl>, have <dbl>,\n#> # i <dbl>, is <dbl>, it <dbl>, labrador <dbl>, like <dbl>, looks <dbl>,\n#> # meat <dbl>, more <dbl>, most <dbl>, my <dbl>, of <dbl>, processed <dbl>,\n#> # product <dbl>, products <dbl>, quality <dbl>, several <dbl>, she <dbl>,\n#> # smells <dbl>, stew <dbl>, than <dbl>, the <dbl>, them <dbl>, this <dbl>,\n#> # to <dbl>, vitality <dbl>, actually <dbl>, an <dbl>, arrived <dbl>, …\n```\n:::\n\n\n\n\nDespite this tibble containing 15,000 rows and a little under 25,000 columns, it only takes up marginally more space than the sparse matrix.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlobstr::obj_size(reviews)\n#> 12.75 MB\nlobstr::obj_size(reviews_tbl)\n#> 18.27 MB\n```\n:::\n\n\n\n\nThe outcome `SCORE` is currently encoded as a double, but we want it to be a factor for it to work well with tidymodels.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nreviews_tbl <- reviews_tbl |>\n mutate(SCORE = factor(SCORE, levels = c(1, 0), labels = c(\"great\", \"other\")))\n```\n:::\n\n\n\n\nSince `reviews_tbl` is now a tibble, we can use `initial_split()` as we usually do.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(1234)\n\nreview_split <- initial_split(reviews_tbl)\nreview_train <- training(review_split)\nreview_test <- testing(review_split)\n\nreview_folds <- vfold_cv(review_train)\n```\n:::\n\n\n\n\nNext, we will specify our workflow. Since we are showcasing how sparse data works in tidymodels, we will stick to a simple lasso regression model. These models tend to work well with sparse predictors. `penalty` has been set to be tuned.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nrec_spec <- recipe(SCORE ~ ., data = review_train)\n\nlm_spec <- logistic_reg(penalty = tune()) |>\n set_engine(\"glmnet\")\n\nwf_spec <- workflow(rec_spec, lm_spec)\n```\n:::\n\n\n\n\nWith everything in order, we can now evaluate several different values of `penalty` with `tune_grid()`.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntune_res <- tune_grid(wf_spec, review_folds)\n```\n:::\n\n\n\n\nDespite the size of the data, this code runs quite quickly due to the sparse encoding of the data. Once the tuning process is done, then we can look at the performance for different values of regularization.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(tune_res)\n```\n\n::: {.cell-output-display}\n![](figs/autoplot-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\n\n\nWe can now finalize the workflow and fit the final model on the training data set.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nwf_final <- finalize_workflow(\n wf_spec, \n select_best(tune_res, metric = \"roc_auc\")\n )\n\nwf_fit <- fit(wf_final, review_train)\n```\n:::\n\n\n\n\nWith this fitted model, we can now predict with a sparse tibble.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npredict(wf_fit, review_test)\n#> # A tibble: 3,750 × 1\n#> .pred_class\n#> <fct> \n#> 1 other \n#> 2 great \n#> 3 great \n#> 4 great \n#> 5 great \n#> 6 great \n#> 7 great \n#> 8 great \n#> 9 great \n#> 10 great \n#> # ℹ 3,740 more rows\n```\n:::\n\n\n\n\n## Session information {#session-info}\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> setting value\n#> version R version 4.4.0 (2024-04-24)\n#> os macOS 15.0\n#> system aarch64, darwin20\n#> ui X11\n#> language (EN)\n#> collate en_US.UTF-8\n#> ctype en_US.UTF-8\n#> tz America/Los_Angeles\n#> date 2024-10-14\n#> pandoc 2.17.1.1 @ /opt/homebrew/bin/ (via rmarkdown)\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package * version date (UTC) lib source\n#> broom * 1.0.6 2024-05-17 [1] CRAN (R 4.4.0)\n#> dials * 1.3.0.9000 2024-09-23 [1] local\n#> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)\n#> ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)\n#> infer * 1.0.7 2024-03-25 [1] CRAN (R 4.4.0)\n#> parsnip * 1.2.1.9002 2024-10-02 [1] local\n#> purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)\n#> recipes * 1.1.0.9000 2024-10-04 [1] local\n#> rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)\n#> rsample * 1.2.1.9000 2024-09-18 [1] Github (tidymodels/rsample@77fc1fe)\n#> sparsevctrs * 0.1.0.9002 2024-09-30 [1] Github (r-lib/sparsevctrs@b29b723)\n#> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)\n#> tidymodels * 1.2.0 2024-03-25 [1] CRAN (R 4.4.0)\n#> tune * 1.2.1 2024-04-18 [1] CRAN (R 4.4.0)\n#> workflows * 1.1.4.9000 2024-09-24 [1] local\n#> yardstick * 1.3.1 2024-03-21 [1] CRAN (R 4.4.0)\n#> \n#> [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n",
"supporting": [],
"filters": [
"rmarkdown/pagebreak.lua"
],
"includes": {},
"engineDependencies": {},
"preserve": {},
"postProcess": true
}
}
2 changes: 2 additions & 0 deletions installs.R
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ packages <- c(
"kernlab",
"klaR",
"leaflet",
"lobstr",
"mda",
"mlbench",
"modeldata",
Expand Down Expand Up @@ -56,6 +57,7 @@ packages <- c(
"sessioninfo",
"readmission",
"skimr",
"sparsevctrs",
"spatialsample",
"stacks",
"stopwords",
Expand Down
1 change: 1 addition & 0 deletions learn/index-listing.json
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
"/learn/statistics/infer/index.html",
"/learn/work/bayes-opt/index.html",
"/learn/statistics/k-means/index.html",
"/learn/work/sparse-matrix/index.html",
"/learn/work/tune-svm/index.html",
"/learn/models/time-series/index.html",
"/learn/models/pls/index.html",
Expand Down
1 change: 1 addition & 0 deletions learn/index.html.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,6 @@ listing:




After you know [what you need to get started](/start/) with tidymodels, you can learn more and go further. Find articles here to help you solve specific problems using the tidymodels framework.

Loading