RFC: Return fitted model as `fitresult` #40

rikhuijzer · 2024-01-09T11:15:15Z

I tried to use a GLM/StatsBase function to extract data from the fitted model. However, this is impossible because this package does not return the original GLM model (fitted_lm is dropped):

MLJGLMInterface.jl/src/MLJGLMInterface.jl

Lines 367 to 383 in 924d0e9

    
           function MMI.fit(model::LinearRegressor, verbosity::Int, X, y, w=nothing) 
        
               # apply the model 
        
               Xmatrix, offset, features = prepare_inputs(model, X) 
        
               y_ = isempty(offset) ? y : y .- offset 
        
               wts = check_weights(w, y_) 
        
               data = glm_data(model, Xmatrix, y_, features) 
        
               form = glm_formula(model, features) 
        
               fitted_lm = GLM.lm(form, data; model.dropcollinear, wts).model 
        
               fitresult = FitResult( 
        
                   GLM.coef(fitted_lm), GLM.dispersion(fitted_lm), (features = features,) 
        
               ) 
        
               # form the report 
        
               report = glm_report(fitted_lm, features, model.report_keys) 
        
               cache = nothing 
        
               # return 
        
               return fitresult, cache, report 
        
           end

This is fine for most use-cases, but has one problem: Julia relies heavily on the fact that people can "attach" arbitrary functions to certain objects. In this case, for example, Julia returns an

StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

object when calling GLM.lm(@formula(y ~ x), data). Subsequently, people can call functions like

GLM.confint(fitted_model, data)

To get the confidence interval. Currently, GLM.confint cannot be called because the fitted model is dropped by MLJ.

Can we switch MLJGLMInterface over to just report the fitted model as fitresult (specifically, fitresult = fitted_lm)? I suggest to fully drop the FitResult struct that is defined in this package. Clients can still obtain information such as the coefficients via the appropriate GLM functions, such as GLM.coef. We can add this and a few other functions to the docstring.

Because this would be a breaking release, I also suggest to move to version 1.0.0, so that future releases can specify whether it's a major, minor, or patch release (https://semver.org).

@ablaom Any thoughts?

The text was updated successfully, but these errors were encountered:

ablaom · 2024-01-10T03:16:28Z

Mmm. I sense your frustration and the simplicity of your suggestion is naturally appealing. But I, for one, have concerns:

If I remember correctly the reason we dropped the GLM object in the first place was because it is not free of training data. This is generally considered bad practice - for example it is bad when serializing models on big datasets. Do I remember correctly?
The casual user who just wants confidence intervals, for example, needs to be familiar with GLM API, to get them, in the proposed revision. This seems a pity to someone like me, who doesn't use these models often.
A lot of development effort has gone into striking the right balance in what is returned by fitted_params. I don't remember of all of the discussion, but I should like to see a review of that.

What if we return the raw StatsModels.TableRegressionModel as part of the report? (It may be that the report is also retained in prep for serialization, but there are other good arguments for not doing so.)

@OkonSamuel may want to comment.

OkonSamuel · 2024-01-10T13:28:12Z

@rikhuijzer all those should have been moved to the model's report. Keeping the glm_model object caused some concerns (see #16) as the model wraps the dataset itself. The model has hyper_parameters that you could use to specify what you want to extract from the glm_model.

If it's an absolute must to have the whole glm_object returned, we could add it to the report rather than fitresult and add a key to that controls whether or not this is returned and also add a warning about memory concerns.

rikhuijzer · 2024-01-10T14:27:56Z

I sense your frustration and the simplicity of your suggestion is naturally appealing. But I, for one, have concerns:

Yes I was editing a LaTeX document, while running PDF generation, while thinking about the analysis, while having Pluto open, while reading supervisor comments, while going through StatsBase and GLM and I should probably not have filed an issue immediately and instead let it sit for a day! My apologies.

What if we return the raw StatsModels.TableRegressionModel as part of the report?

Yes I do think that that could work. That would make already some extra functions available I think.

If it's an absolute must to have the whole glm_object returned, we could add it to the report rather than fitresult and add a key to that controls whether or not this is returned and also add a warning about memory concerns.

Well I don't know if it's an absolute must. What I needed, for example, was a different p-value than the default < 0.05 cutoff. To get this, I had to search through the StatsModels, GLM, etc etc codebases. Luckily, Survival.jl had:

pvalues = 2 .* Distributions.cdf(Distributions.Normal(), -abs.(z_scores))

and the GLM returns the z-score which you can extract by parsing the returned text 😅.

So I understand that most of our problems here were caused by GLM having a weird API from the perspective of machine learning (storing data, not having much keywords, having very elaborate types).

Maybe we should just let it be? My problem is solved for now via manually calculating p-values and parsing the text. Again. Apologies for the noise.

rikhuijzer changed the title ~~RFC: Just return fitted model as fitresult~~ RFC: Return fitted model as fitresult Jan 9, 2024

rikhuijzer closed this as completed Jan 15, 2024

ablaom mentioned this issue Jan 16, 2024

Expose raw fitted GLM model in report #41

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Return fitted model as `fitresult` #40

RFC: Return fitted model as `fitresult` #40

rikhuijzer commented Jan 9, 2024 •

edited

Loading

ablaom commented Jan 10, 2024

OkonSamuel commented Jan 10, 2024 •

edited

Loading

rikhuijzer commented Jan 10, 2024 •

edited

Loading

RFC: Return fitted model as fitresult #40

RFC: Return fitted model as fitresult #40

Comments

rikhuijzer commented Jan 9, 2024 • edited Loading

ablaom commented Jan 10, 2024

OkonSamuel commented Jan 10, 2024 • edited Loading

rikhuijzer commented Jan 10, 2024 • edited Loading

RFC: Return fitted model as `fitresult` #40

RFC: Return fitted model as `fitresult` #40

rikhuijzer commented Jan 9, 2024 •

edited

Loading

OkonSamuel commented Jan 10, 2024 •

edited

Loading

rikhuijzer commented Jan 10, 2024 •

edited

Loading