Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for fastDummies to overcome memory problems of model.matrix? #35

Open
jarbet opened this issue Jul 25, 2024 · 0 comments
Open

Comments

@jarbet
Copy link

jarbet commented Jul 25, 2024

When the number of predictors is large, model.matrix quickly blows up memory when using the formula interface. For example, I get memory errors when trying to fit a model with 30k predictors and 100 GB of RAM.

A simple solution is to use the fastDummies R package to convert factors/character features to numeric dummy variables. This function is much more memory efficient, i.e. I am running my same model on a computer with 15 GB RAM (when printing gc(), it says at most 2 GB of RAM was used).

Here's an example of how to use fastDummies to setup the x matrix:

suppressPackageStartupMessages(library(fastDummies));
data(iris);

x <- iris;
head(x);
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

x.matrix <- as.matrix(fastDummies::dummy_columns(
    .data = x,
    remove_first_dummy = TRUE, # use K-1 dummy variables for a factor with K levels
    remove_selected_columns = TRUE # remove the original factor variables, otherwise it still keeps them by default
    ));
rownames(x.matrix) <- rownames(x); # if patient ids are rownames, need to readd here.
head(x.matrix)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species_versicolor
#> 1          5.1         3.5          1.4         0.2                  0
#> 2          4.9         3.0          1.4         0.2                  0
#> 3          4.7         3.2          1.3         0.2                  0
#> 4          4.6         3.1          1.5         0.2                  0
#> 5          5.0         3.6          1.4         0.2                  0
#> 6          5.4         3.9          1.7         0.4                  0
#>   Species_virginica
#> 1                 0
#> 2                 0
#> 3                 0
#> 4                 0
#> 5                 0
#> 6                 0

Created on 2024-07-25 with reprex v2.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant