Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow handling of categorical predictor variables #44

Closed
tiemvanderdeure opened this issue Feb 6, 2024 · 4 comments
Closed

Allow handling of categorical predictor variables #44

tiemvanderdeure opened this issue Feb 6, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@tiemvanderdeure
Copy link
Contributor

when fit!ing a model, any categorical predictor variables are converted to floating-point values before the data is passed to GLM.lm, so any information about levels is lost and the predictor is treated as if it were continuous.

fit_data_scitype does say categorical values aren't allowed, which leads me to think that might be some particular reason that categorical predictors are handled this way?

If there isn't I'll go ahead and write a PR later this week. GLM.lm supports categorical predictor values, so I can't immediately see why this should be a problem.

n = 20
X = (a = rand(n), b = categorical(rand(Binomial(3, 0.5), n)))
response = X.a .> rand(n)

mach = machine(LinearBinaryClassifier(), X, categorical(response))
fit!(mach)
f = mach.report[:fit]
print(f)
m = GLM.lm(@formula(y ~ a + b), merge(X, (; y = response)))
print(m)

gives

┌ Warning: The number and/or types of data arguments do not match what the specified model
│ supports. Suppress this type check by specifying `scitype_check_level=0`.
│ 
│ Run `@doc GLM.LinearBinaryClassifier` to learn more about your model's requirements.
│ 
│ Commonly, but non exclusively, supervised models are constructed using the syntax
│ `machine(model, X, y)` or `machine(model, X, y, w)` while most other models are
│ constructed with `machine(model, X)`.  Here `X` are features, `y` a target, and `w`
│ sample or class weights.
│ 
│ In general, data in `machine(model, data...)` is expected to satisfy
│ 
│     scitype(data) <: MLJ.fit_data_scitype(model)
│ 
│ In the present case:
│ 
│ scitype(data) = Tuple{Table{Union{AbstractVector{ScientificTypesBase.Continuous}, AbstractVector{Multiclass{4}}}}, AbstractVector{Multiclass{2}}}
│ 
│ fit_data_scitype(model) = Union{Tuple{Table{<:AbstractVector{<:ScientificTypesBase.Continuous}}, AbstractVector{<:Binary}}, Tuple{Table{<:AbstractVector{<:ScientificTypesBase.Continuous}}, AbstractVector{<:Binary}, AbstractVector{<:Union{ScientificTypesBase.Continuous, Count}}}}
└ @ MLJBase ~/.julia/packages/MLJBase/mIaqI/src/machines.jl:231
[ Info: Training machine(LinearBinaryClassifier(fit_intercept = true, …), …).
(stderror = [5.737179687127579, 1.5035639756193406, 5.3873328870352815], dof_residual = 18.0, vcov = [32.91523076238931 5.311462726994682 -28.65288472943233; 5.311462726994682 2.2607046287802373 -6.80973765191617; -28.65288472943233 -6.80973765191617 29.023355635731903], deviance = 10.020776256140987, coef_table = ──────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      z  Pr(>|z|)   Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
a             13.0482      5.73718   2.27    0.0229    1.80349    24.2928
b              2.27576     1.50356   1.51    0.1301   -0.671174    5.22269
(Intercept)  -12.1069      5.38733  -2.25    0.0246  -22.6659     -1.54793
──────────────────────────────────────────────────────────────────────────)StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

y ~ 1 + a + b

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)  -0.726863     0.388329  -1.87    0.0809  -1.55457    0.100841
a             1.46431      0.342933   4.27    0.0007   0.733371   2.19526
b: 1          0.0888761    0.389225   0.23    0.8225  -0.740737   0.918489
b: 2          0.397777     0.365725   1.09    0.2939  -0.381747   1.1773
b: 3          0.690686     0.411623   1.68    0.1141  -0.186668   1.56804
──────────────────────────────────────────────────────────────────────────
@ablaom
Copy link
Member

ablaom commented Feb 6, 2024

Thanks for chiming in here @tiemvanderdeure, and for the offer of help.

Yes, this is not a bug but a feature limitation. I expect (but haven't checked) that GLM is just one-hot encoding here, so as a workaround you could use MLJ's ContinuousEncoder() in a pipeline ContinuousEncoder() |> LinearBinaryClassifier().

The ordinary way of extending functionality in this case starts by expanding the input_scitype declaration for the model. In the code I see this is set in a metadata_model block using the alias input. The new declaration would be

input = Table(Continuous, Finite)

This is a contract that the user can supply any table for input X, so long as the columns are either AbstractFloat or CategoricalValue (column is a CategoricalVector); and that the CategoricalValue features will be treated as unordered factors by the core algorithm. If GLM can only handle ordered factors, then replace Finite with OrderedFactor and CategoricalValue with ordered CategoricalValue.

Next, it will be up to the implementation to ensure it passes off the categorical columns in a form that GLM expects them to be. I don't know what this is - Integer? - or do you have to explicitly pass some metadata listing the categorical feature indices?.

It might also be a good idea to store the class pools in the fitresult and have predict check that categorical feautures in new input Xnew have consistent class pools. A naive user might think he can just convert integer columns using categorical separately for test and train. If a column in test is missing a class, this could result in wrong behaviour.

At present fit_input_scitype is inferred from input_scitype and other traits.

@ablaom ablaom changed the title correctly handle categorical predictor variables Allow handling of categorical predictor variables Feb 6, 2024
@ablaom ablaom added the enhancement New feature or request label Feb 6, 2024
@tiemvanderdeure
Copy link
Contributor Author

GLM is doing one-hot encoding, yes. (Or I think more specifically, StatsModels is).

I think it would just be easiest to pass them on as a CategoricalValue and let GLM do the encoding. Right now fit calls Tables.matrix in _matrix_and_features, which is where the categorical values are converted to floats.

I think we could get away with just passing something like merge(Tables.columntable(X), (;y = y_plain) to GLM.lm and let GLM do all the rest. It would let us do more, not less, data handling, and it would also solve the issue with Float32 values I made an issue about the other day: #42. But maybe I'm missing something?

And yes, I'm all for some basic checks on data types provided. If we just reconstruct the model matrix using StatsModels.modelcols I think some of that will already be taken care of, though.

@ablaom
Copy link
Member

ablaom commented Feb 7, 2024

I think we could get away with just passing something like merge(Tables.columntable(X), (;y = y_plain) to GLM.lm and let GLM do all the rest. It would let us do more, not less, data handling, and it would also solve the issue with Float32 values I made an issue about the other day: #42. But maybe I'm missing something?

I did not know that GLM handles tabular input, and that CategoricalValues get treated as (unordered) categoricals. If that is the case then your proposal sounds like a definite improvement.

I expect we should also mirror the new handling at the predict stage as well.

@tiemvanderdeure
Copy link
Contributor Author

Categorical variables are supported as of #45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants