Allow handling of categorical predictor variables #44

tiemvanderdeure · 2024-02-06T17:56:31Z

when fit!ing a model, any categorical predictor variables are converted to floating-point values before the data is passed to GLM.lm, so any information about levels is lost and the predictor is treated as if it were continuous.

fit_data_scitype does say categorical values aren't allowed, which leads me to think that might be some particular reason that categorical predictors are handled this way?

If there isn't I'll go ahead and write a PR later this week. GLM.lm supports categorical predictor values, so I can't immediately see why this should be a problem.

n = 20
X = (a = rand(n), b = categorical(rand(Binomial(3, 0.5), n)))
response = X.a .> rand(n)

mach = machine(LinearBinaryClassifier(), X, categorical(response))
fit!(mach)
f = mach.report[:fit]
print(f)
m = GLM.lm(@formula(y ~ a + b), merge(X, (; y = response)))
print(m)

gives

┌ Warning: The number and/or types of data arguments do not match what the specified model
│ supports. Suppress this type check by specifying `scitype_check_level=0`.
│ 
│ Run `@doc GLM.LinearBinaryClassifier` to learn more about your model's requirements.
│ 
│ Commonly, but non exclusively, supervised models are constructed using the syntax
│ `machine(model, X, y)` or `machine(model, X, y, w)` while most other models are
│ constructed with `machine(model, X)`.  Here `X` are features, `y` a target, and `w`
│ sample or class weights.
│ 
│ In general, data in `machine(model, data...)` is expected to satisfy
│ 
│     scitype(data) <: MLJ.fit_data_scitype(model)
│ 
│ In the present case:
│ 
│ scitype(data) = Tuple{Table{Union{AbstractVector{ScientificTypesBase.Continuous}, AbstractVector{Multiclass{4}}}}, AbstractVector{Multiclass{2}}}
│ 
│ fit_data_scitype(model) = Union{Tuple{Table{<:AbstractVector{<:ScientificTypesBase.Continuous}}, AbstractVector{<:Binary}}, Tuple{Table{<:AbstractVector{<:ScientificTypesBase.Continuous}}, AbstractVector{<:Binary}, AbstractVector{<:Union{ScientificTypesBase.Continuous, Count}}}}
└ @ MLJBase ~/.julia/packages/MLJBase/mIaqI/src/machines.jl:231
[ Info: Training machine(LinearBinaryClassifier(fit_intercept = true, …), …).
(stderror = [5.737179687127579, 1.5035639756193406, 5.3873328870352815], dof_residual = 18.0, vcov = [32.91523076238931 5.311462726994682 -28.65288472943233; 5.311462726994682 2.2607046287802373 -6.80973765191617; -28.65288472943233 -6.80973765191617 29.023355635731903], deviance = 10.020776256140987, coef_table = ──────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      z  Pr(>|z|)   Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
a             13.0482      5.73718   2.27    0.0229    1.80349    24.2928
b              2.27576     1.50356   1.51    0.1301   -0.671174    5.22269
(Intercept)  -12.1069      5.38733  -2.25    0.0246  -22.6659     -1.54793
──────────────────────────────────────────────────────────────────────────)StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

y ~ 1 + a + b

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)  -0.726863     0.388329  -1.87    0.0809  -1.55457    0.100841
a             1.46431      0.342933   4.27    0.0007   0.733371   2.19526
b: 1          0.0888761    0.389225   0.23    0.8225  -0.740737   0.918489
b: 2          0.397777     0.365725   1.09    0.2939  -0.381747   1.1773
b: 3          0.690686     0.411623   1.68    0.1141  -0.186668   1.56804
──────────────────────────────────────────────────────────────────────────

The text was updated successfully, but these errors were encountered:

ablaom · 2024-02-06T19:36:03Z

Thanks for chiming in here @tiemvanderdeure, and for the offer of help.

Yes, this is not a bug but a feature limitation. I expect (but haven't checked) that GLM is just one-hot encoding here, so as a workaround you could use MLJ's ContinuousEncoder() in a pipeline ContinuousEncoder() |> LinearBinaryClassifier().

The ordinary way of extending functionality in this case starts by expanding the input_scitype declaration for the model. In the code I see this is set in a metadata_model block using the alias input. The new declaration would be

input = Table(Continuous, Finite)

This is a contract that the user can supply any table for input X, so long as the columns are either AbstractFloat or CategoricalValue (column is a CategoricalVector); and that the CategoricalValue features will be treated as unordered factors by the core algorithm. If GLM can only handle ordered factors, then replace Finite with OrderedFactor and CategoricalValue with ordered CategoricalValue.

Next, it will be up to the implementation to ensure it passes off the categorical columns in a form that GLM expects them to be. I don't know what this is - Integer? - or do you have to explicitly pass some metadata listing the categorical feature indices?.

It might also be a good idea to store the class pools in the fitresult and have predict check that categorical feautures in new input Xnew have consistent class pools. A naive user might think he can just convert integer columns using categorical separately for test and train. If a column in test is missing a class, this could result in wrong behaviour.

At present fit_input_scitype is inferred from input_scitype and other traits.

tiemvanderdeure · 2024-02-07T09:18:12Z

GLM is doing one-hot encoding, yes. (Or I think more specifically, StatsModels is).

I think it would just be easiest to pass them on as a CategoricalValue and let GLM do the encoding. Right now fit calls Tables.matrix in _matrix_and_features, which is where the categorical values are converted to floats.

I think we could get away with just passing something like merge(Tables.columntable(X), (;y = y_plain) to GLM.lm and let GLM do all the rest. It would let us do more, not less, data handling, and it would also solve the issue with Float32 values I made an issue about the other day: #42. But maybe I'm missing something?

And yes, I'm all for some basic checks on data types provided. If we just reconstruct the model matrix using StatsModels.modelcols I think some of that will already be taken care of, though.

ablaom · 2024-02-07T21:06:16Z

I think we could get away with just passing something like merge(Tables.columntable(X), (;y = y_plain) to GLM.lm and let GLM do all the rest. It would let us do more, not less, data handling, and it would also solve the issue with Float32 values I made an issue about the other day: #42. But maybe I'm missing something?

I did not know that GLM handles tabular input, and that CategoricalValues get treated as (unordered) categoricals. If that is the case then your proposal sounds like a definite improvement.

I expect we should also mirror the new handling at the predict stage as well.

tiemvanderdeure · 2024-02-26T05:41:21Z

Categorical variables are supported as of #45

ablaom changed the title ~~correctly handle categorical predictor variables~~ Allow handling of categorical predictor variables Feb 6, 2024

ablaom added the enhancement New feature or request label Feb 6, 2024

tiemvanderdeure mentioned this issue Feb 14, 2024

Retain data in tabular form #45

Merged

tiemvanderdeure closed this as completed Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow handling of categorical predictor variables #44

Allow handling of categorical predictor variables #44

tiemvanderdeure commented Feb 6, 2024

ablaom commented Feb 6, 2024 •

edited

Loading

tiemvanderdeure commented Feb 7, 2024

ablaom commented Feb 7, 2024 •

edited

Loading

tiemvanderdeure commented Feb 26, 2024

Allow handling of categorical predictor variables #44

Allow handling of categorical predictor variables #44

Comments

tiemvanderdeure commented Feb 6, 2024

ablaom commented Feb 6, 2024 • edited Loading

tiemvanderdeure commented Feb 7, 2024

ablaom commented Feb 7, 2024 • edited Loading

tiemvanderdeure commented Feb 26, 2024

ablaom commented Feb 6, 2024 •

edited

Loading

ablaom commented Feb 7, 2024 •

edited

Loading