Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain data in tabular form #45

Merged
merged 7 commits into from
Feb 26, 2024
Merged

Conversation

tiemvanderdeure
Copy link
Contributor

I rewrote some parts of this package in order to interface with GLM.jl more directly. Hopefully, this will make sure all the functionality GLM.jl has will also be available through MLJ.

This pull solves both #44 and #42.

I removed _matrix_and_features as this is where the variable types (CategoricalVector) was lost.

In order to predict with categorical values properly, we need the entire object returned by GLM.glm, as this contains information about classes. For now, I simply removed the FitResult and return the TableRegressionModel returned by GLM.glm instead (we can revert this if it is problematic, and find some other solution). I also changed predict, so we call GLM.predict, rather than constructing the model matrix in this package. GLM.predict calls StatsModels.modelcols, which does the dummy-encoding for categorical variables.

@tiemvanderdeure
Copy link
Contributor Author

After reading some old PRs I realised we need FitResult so I added it again.

I just added the FormulaTerm as a field in FitResult, so we can reconstruct the model matrix without saving the entire TableRegressionModel.

test/runtests.jl Outdated Show resolved Hide resolved
Copy link

codecov bot commented Feb 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.46%. Comparing base (7f48db6) to head (51d7d52).
Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #45      +/-   ##
==========================================
+ Coverage   96.79%   97.46%   +0.67%     
==========================================
  Files           1        1              
  Lines         187      158      -29     
==========================================
- Hits          181      154      -27     
+ Misses          6        4       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ablaom
Copy link
Member

ablaom commented Feb 15, 2024

Thanks @tiemvanderdeure for this substantial PR 🙏🏾

In testing locally, I was struggling to cook up data sets for which Cholesky factorization would not fail, so I tried a simple one-dimensional example (here adapted from the GLM docs). But it's not working either:

using CategoricalArrays, StableRNGs, MLJModelInterface
using MLJGLMInterface # dev'ed from the source branch of this PR
rng = StableRNG(1); # Ensure example can be reproduced

y = rand(rng, 100)
X = (;x = categorical(repeat([1, 2, 3, 4], 25)))

model = LinearRegressor()
MLJModelInterface.fit(model, 0, X, y)

# ERROR: PosDefException: matrix is not positive definite; Cholesky factorization failed.
# Stacktrace:
#   [1] checkpositivedefinite
#     @ /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/LinearAlgebra/src/factorization.jl:67 [inlined]
#   [2] #cholesky!#140
#     @ /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/LinearAlgebra/src/cholesky.jl:269 [inlined]
#   [3] cholesky! (repeats 2 times)
#     @ /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/LinearAlgebra/src/cholesky.jl:267 [inlined]

Am I missing something?

@tiemvanderdeure
Copy link
Contributor Author

Good catch! Clearly we need some more tests.

I just looked at this and there are a few things going on here.

Firstly dropcollinear defaults to true in GLM.lm, while it defaults to false in LinearRegressor. Setting it to false fixes the error. In my opinion, since this package is an interface, we should refrain from overwriting defaults set by GLM. (I also just noticed that neither LinearCountRegressor nor LinearBinaryClassifier have a dropcollinear field. I think this might be an oversight?)

Secondly, it appears the position of intercept in a formula matters. StatsModels.@formula will always put intercept as the first term. In this package, we put it at the end in glm_formula. Reversing this so the intercept is the first term will actually also fix the error (I have no clue why, but it does).

I'll implement the second straight away. Changing default settings might be a bit trickier, what do you think?

@ablaom
Copy link
Member

ablaom commented Feb 21, 2024

Changing default settings might be a bit trickier, what do you think?

Generally we mirror the defaults of the wrapped model. In this commit (some time back) I see that allowrankdeficient=false was replaced with dropcolinear=false, which is effectively a reversal.

@OkonSamuel Do you remember a reason for making the new default false?

@tiemvanderdeure
Copy link
Contributor Author

Maybe we should make a new PR for the defaults, just to get things not too mixed up.

Checks pass and as far as I am concerned this PR is good to go.

Copy link
Member

@ablaom ablaom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Good to go.

Thanks @tiemvanderdeure for your valuable contribution and patience. 🙏🏾

@rikhuijzer
Copy link
Member

Thanks @tiemvanderdeure! I don't know the details of this PR, but I see 160 lines removed and only 100 added so that looks very good to me! Thanks to you and Anthony for improving this package 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants