You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
I am currently trying to use textrecipes in a project as part of our NLP pipeline in connection with {tidymodels}. At one point I came across a problem for which I have not yet found a solution. My problem is that the step textrecipes::step_tfidf apparently only generates a dense matrix (in the form of a tibble) and not a sparse matrix (dgCMatrix) and this leads to such a large object that I cannot process it in memory. The details in the documentation for this function also describe that step_tokenfilter should be executed in advance for this purpose. I would be very reluctant to do this, however, as I assume that in a sparse format - which I use as a blueprint in the modelling workflow anyway - the resulting object is sufficiently small. Meanwhile, tidymodels also seems to be able to cope with sparse matrices as input.
So my question is, is there a way to convert from a tokenlist representation to a sparse tf-idf (or other dtm) representation in a recipe step or to use another low-memory format as an intermediate step (such as the format from {tidytext}).
It would also be interesting to know whether this is currently only a technical restriction or whether the idea behind it is that there is no legitimate modelling assumption in which we cannot (better) manage with a token filter or another word embedding?
Many thanks in advance and best regards!
The text was updated successfully, but these errors were encountered:
Long answer: Right now the recipe forces each step to return a tibble. This works fine for the tokenized state as it uses a custom class to store it, but once we turn it into numbers such as with step_tfidf() we are forced to make it a tibble, hence the dense format. So it is techincally correct that tidymodels support sparse input, but recipes carry the data in a dense format before turning it sparse. Which we know is a blocker for some people.
Hey Emil,
thanks for the quick reply. Interesting to hear that a solution for this is already being worked on and how. However, it seems to me that this is a project that will take some time before it is finalised and can be used productively. Do I see that correctly? Until then I will probably not be able to work with {tidymodels}, at least not for this specific project.
Thanks again and best regards!
Hello,
I am currently trying to use textrecipes in a project as part of our NLP pipeline in connection with {tidymodels}. At one point I came across a problem for which I have not yet found a solution. My problem is that the step
textrecipes::step_tfidf
apparently only generates a dense matrix (in the form of a tibble) and not a sparse matrix (dgCMatrix) and this leads to such a large object that I cannot process it in memory. The details in the documentation for this function also describe thatstep_tokenfilter
should be executed in advance for this purpose. I would be very reluctant to do this, however, as I assume that in a sparse format - which I use as a blueprint in the modelling workflow anyway - the resulting object is sufficiently small. Meanwhile, tidymodels also seems to be able to cope with sparse matrices as input.So my question is, is there a way to convert from a tokenlist representation to a sparse tf-idf (or other dtm) representation in a recipe step or to use another low-memory format as an intermediate step (such as the format from {tidytext}).
It would also be interesting to know whether this is currently only a technical restriction or whether the idea behind it is that there is no legitimate modelling assumption in which we cannot (better) manage with a token filter or another word embedding?
Many thanks in advance and best regards!
The text was updated successfully, but these errors were encountered: