Retain data in tabular form #45

tiemvanderdeure · 2024-02-14T16:38:49Z

I rewrote some parts of this package in order to interface with GLM.jl more directly. Hopefully, this will make sure all the functionality GLM.jl has will also be available through MLJ.

This pull solves both #44 and #42.

I removed _matrix_and_features as this is where the variable types (CategoricalVector) was lost.

In order to predict with categorical values properly, we need the entire object returned by GLM.glm, as this contains information about classes. For now, I simply removed the FitResult and return the TableRegressionModel returned by GLM.glm instead (we can revert this if it is problematic, and find some other solution). I also changed predict, so we call GLM.predict, rather than constructing the model matrix in this package. GLM.predict calls StatsModels.modelcols, which does the dummy-encoding for categorical variables.

tiemvanderdeure · 2024-02-15T09:25:30Z

After reading some old PRs I realised we need FitResult so I added it again.

I just added the FormulaTerm as a field in FitResult, so we can reconstruct the model matrix without saving the entire TableRegressionModel.

test/runtests.jl

codecov · 2024-02-15T21:57:14Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.46%. Comparing base (7f48db6) to head (51d7d52).
Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #45      +/-   ##
==========================================
+ Coverage   96.79%   97.46%   +0.67%     
==========================================
  Files           1        1              
  Lines         187      158      -29     
==========================================
- Hits          181      154      -27     
+ Misses          6        4       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ablaom · 2024-02-15T23:29:38Z

Thanks @tiemvanderdeure for this substantial PR 🙏🏾

In testing locally, I was struggling to cook up data sets for which Cholesky factorization would not fail, so I tried a simple one-dimensional example (here adapted from the GLM docs). But it's not working either:

using CategoricalArrays, StableRNGs, MLJModelInterface
using MLJGLMInterface # dev'ed from the source branch of this PR
rng = StableRNG(1); # Ensure example can be reproduced

y = rand(rng, 100)
X = (;x = categorical(repeat([1, 2, 3, 4], 25)))

model = LinearRegressor()
MLJModelInterface.fit(model, 0, X, y)

# ERROR: PosDefException: matrix is not positive definite; Cholesky factorization failed.
# Stacktrace:
#   [1] checkpositivedefinite
#     @ /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/LinearAlgebra/src/factorization.jl:67 [inlined]
#   [2] #cholesky!#140
#     @ /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/LinearAlgebra/src/cholesky.jl:269 [inlined]
#   [3] cholesky! (repeats 2 times)
#     @ /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/LinearAlgebra/src/cholesky.jl:267 [inlined]

Am I missing something?

tiemvanderdeure · 2024-02-16T12:21:13Z

Good catch! Clearly we need some more tests.

I just looked at this and there are a few things going on here.

Firstly dropcollinear defaults to true in GLM.lm, while it defaults to false in LinearRegressor. Setting it to false fixes the error. In my opinion, since this package is an interface, we should refrain from overwriting defaults set by GLM. (I also just noticed that neither LinearCountRegressor nor LinearBinaryClassifier have a dropcollinear field. I think this might be an oversight?)

Secondly, it appears the position of intercept in a formula matters. StatsModels.@formula will always put intercept as the first term. In this package, we put it at the end in glm_formula. Reversing this so the intercept is the first term will actually also fix the error (I have no clue why, but it does).

I'll implement the second straight away. Changing default settings might be a bit trickier, what do you think?

ablaom · 2024-02-21T19:33:02Z

Changing default settings might be a bit trickier, what do you think?

Generally we mirror the defaults of the wrapped model. In this commit (some time back) I see that allowrankdeficient=false was replaced with dropcolinear=false, which is effectively a reversal.

@OkonSamuel Do you remember a reason for making the new default false?

tiemvanderdeure · 2024-02-24T15:22:17Z

Maybe we should make a new PR for the defaults, just to get things not too mixed up.

Checks pass and as far as I am concerned this PR is good to go.

ablaom

Agreed. Good to go.

Thanks @tiemvanderdeure for your valuable contribution and patience. 🙏🏾

rikhuijzer · 2024-02-28T08:30:07Z

Thanks @tiemvanderdeure! I don't know the details of this PR, but I see 160 lines removed and only 100 added so that looks very good to me! Thanks to you and Anthony for improving this package 😃

tiemvanderdeure added 3 commits February 14, 2024 17:20

remove matrix data, interface with glm more directly

668cc43

tweak the tests

068ba59

add FitResult again

4e05569

ablaom reviewed Feb 15, 2024

View reviewed changes

test/runtests.jl Outdated Show resolved Hide resolved

tiemvanderdeure added 3 commits February 16, 2024 13:49

intercept as first term

8984846

more tests for categorical variables

88da661

remove unused function glm_data

84b41d4

add a test for predict with offset

51d7d52

ablaom mentioned this pull request Feb 26, 2024

"Revert" defaults to mirror GLM #47

Open

ablaom approved these changes Feb 26, 2024

View reviewed changes

ablaom merged commit 4bcec2c into JuliaAI:master Feb 26, 2024
4 checks passed

ablaom mentioned this pull request Feb 26, 2024

Issue to trigger new releases #1

Closed

This was referenced Feb 26, 2024

fit! on a LinearBinaryClassifier fails for Float32 values #42

Closed

Allow handling of categorical predictor variables #44

Closed

Add generic MLJ interface tests for categorical features #48

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retain data in tabular form #45

Retain data in tabular form #45

tiemvanderdeure commented Feb 14, 2024

tiemvanderdeure commented Feb 15, 2024

codecov bot commented Feb 15, 2024 •

edited

Loading

ablaom commented Feb 15, 2024 •

edited

Loading

tiemvanderdeure commented Feb 16, 2024

ablaom commented Feb 21, 2024

tiemvanderdeure commented Feb 24, 2024

ablaom left a comment

rikhuijzer commented Feb 28, 2024

Retain data in tabular form #45

Retain data in tabular form #45

Conversation

tiemvanderdeure commented Feb 14, 2024

tiemvanderdeure commented Feb 15, 2024

codecov bot commented Feb 15, 2024 • edited Loading

Codecov Report

ablaom commented Feb 15, 2024 • edited Loading

tiemvanderdeure commented Feb 16, 2024

ablaom commented Feb 21, 2024

tiemvanderdeure commented Feb 24, 2024

ablaom left a comment

Choose a reason for hiding this comment

rikhuijzer commented Feb 28, 2024

codecov bot commented Feb 15, 2024 •

edited

Loading

ablaom commented Feb 15, 2024 •

edited

Loading