-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add sparse data design document #79
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for tidymodels-org ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a first pass for grammar!
```r | ||
library(sparsevctrs) | ||
data_tbl <- coerce_to_sparse_tibble(data_mat) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming there's a conversation somewhere in rsample about this, but—can we happy path this? If it's really just a matter of coercing, can we do so internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about it in a team meeting. The current plan was to ask people to manually convert sparse matrices to sparse tibbles, for use in {rsample} functions.
We could let {rsample} functions take sparse matrices or matrices and turn the data to tibbles, but IMO it feels like we are doing too much with {rsample} functions. But I'm also not holding on to this belief to tightly
Co-authored-by: Simon P. Couch <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huzzah!🕺
```r | ||
c(100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0) | ||
``` | ||
The sparse representation of this vector only requires 5 values. 1 value for the length (25), 2 values for the locations of the non-zero values (1, 22), and 2 values for the non-zero values (100, 1). This idea can also be extended to matrices as is done in the Matrix package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need an additional value for the default, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left out that detail here. It is technically not needed, but something I have added specifically in {sparsevctrs} to allow for some cool interactions.
|
||
```r | ||
library(sparsevctrs) | ||
data_tbl <- coerce_to_sparse_tibble(data_mat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be a comment in sparsevctrs but a better name would be coerce_matrix_to_sparse_tibble()
or to make it an S3 method (the latter seems like a better option to me).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea (about s3 method). Added issue here: r-lib/sparsevctrs#78
Co-authored-by: Max Kuhn <[email protected]> Co-authored-by: Simon P. Couch <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read through the pair of articles and would like to suggest some restructuring motivated by the question "Who is the intended audience?".
I percieve articles on tidymodels.org to be primarily intented for tidymodels users, with the articles under the developer tools
tag intentionally shifting on the user-developer spectrum.
I think all of the content in those two articles is super valuable to have, I would just reshuffle it:
What is now "Model tuning using a sparse matrix" could become "Sparse data in tidymodels" and gain the intro paragraph from the current design doc on what sparse data is, before jumping into the example. The info on what preserves sparsity and what doesn't, currently in the design doc, could be threaded into the application article as we move through resampling, preprocessing, model-fitting, and highlighting the benefits when doing it all many times while tuning. I would also enjoy it if the application was leaning a little bit more towards case study and talked a bit more about the specific problem. Is there a story to be told with this dataset?
The design notes, with details of the parsnip encodings such as allow_sparse_x
, probably work better as a "for us, just public" document. Would the planning repo be a good place for that? Anything of the magnitude of "needs a wiggle around model.matrix()
" should definitely be documented for future us, ideally with a few more notes. If we make the doc explicitly directed at us, those notes can assume advanced knowledge of tidymodels internals and therefore focus on the yet-to-solve problems without having to work on tidymodels context.
And lastly a question about sparsitiy in recipes: Does that mean that a sparse vector in a sparsetibble remains sparse while going through recipes if, and only if, it is untouched by the steps? I.e., if any processing happens on the sparse predictor, it is currently materialized again because the steps themselves don't know yet how to deal with sparsitiy?
This is the first of two planned posts about sparse data in tidymodels.
this first post talks about how sparse data is being used across the different tidymodels packages
the next post will be a worked example