Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Return fitted model as fitresult #40

Closed
rikhuijzer opened this issue Jan 9, 2024 · 3 comments
Closed

RFC: Return fitted model as fitresult #40

rikhuijzer opened this issue Jan 9, 2024 · 3 comments

Comments

@rikhuijzer
Copy link
Member

rikhuijzer commented Jan 9, 2024

I tried to use a GLM/StatsBase function to extract data from the fitted model. However, this is impossible because this package does not return the original GLM model (fitted_lm is dropped):

function MMI.fit(model::LinearRegressor, verbosity::Int, X, y, w=nothing)
# apply the model
Xmatrix, offset, features = prepare_inputs(model, X)
y_ = isempty(offset) ? y : y .- offset
wts = check_weights(w, y_)
data = glm_data(model, Xmatrix, y_, features)
form = glm_formula(model, features)
fitted_lm = GLM.lm(form, data; model.dropcollinear, wts).model
fitresult = FitResult(
GLM.coef(fitted_lm), GLM.dispersion(fitted_lm), (features = features,)
)
# form the report
report = glm_report(fitted_lm, features, model.report_keys)
cache = nothing
# return
return fitresult, cache, report
end

This is fine for most use-cases, but has one problem: Julia relies heavily on the fact that people can "attach" arbitrary functions to certain objects. In this case, for example, Julia returns an

StatsModels.TableRegressionModel{GLM.LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

object when calling GLM.lm(@formula(y ~ x), data). Subsequently, people can call functions like

GLM.confint(fitted_model, data)

To get the confidence interval. Currently, GLM.confint cannot be called because the fitted model is dropped by MLJ.

Can we switch MLJGLMInterface over to just report the fitted model as fitresult (specifically, fitresult = fitted_lm)? I suggest to fully drop the FitResult struct that is defined in this package. Clients can still obtain information such as the coefficients via the appropriate GLM functions, such as GLM.coef. We can add this and a few other functions to the docstring.

Because this would be a breaking release, I also suggest to move to version 1.0.0, so that future releases can specify whether it's a major, minor, or patch release (https://semver.org).

@ablaom Any thoughts?

@rikhuijzer rikhuijzer changed the title RFC: Just return fitted model as fitresult RFC: Return fitted model as fitresult Jan 9, 2024
@ablaom
Copy link
Member

ablaom commented Jan 10, 2024

Mmm. I sense your frustration and the simplicity of your suggestion is naturally appealing. But I, for one, have concerns:

  1. If I remember correctly the reason we dropped the GLM object in the first place was because it is not free of training data. This is generally considered bad practice - for example it is bad when serializing models on big datasets. Do I remember correctly?

  2. The casual user who just wants confidence intervals, for example, needs to be familiar with GLM API, to get them, in the proposed revision. This seems a pity to someone like me, who doesn't use these models often.

  3. A lot of development effort has gone into striking the right balance in what is returned by fitted_params. I don't remember of all of the discussion, but I should like to see a review of that.

What if we return the raw StatsModels.TableRegressionModel as part of the report? (It may be that the report is also retained in prep for serialization, but there are other good arguments for not doing so.)

@OkonSamuel may want to comment.

@OkonSamuel
Copy link
Member

OkonSamuel commented Jan 10, 2024

@rikhuijzer all those should have been moved to the model's report. Keeping the glm_model object caused some concerns (see #16) as the model wraps the dataset itself. The model has hyper_parameters that you could use to specify what you want to extract from the glm_model.

If it's an absolute must to have the whole glm_object returned, we could add it to the report rather than fitresult and add a key to that controls whether or not this is returned and also add a warning about memory concerns.

@rikhuijzer
Copy link
Member Author

rikhuijzer commented Jan 10, 2024

I sense your frustration and the simplicity of your suggestion is naturally appealing. But I, for one, have concerns:

Yes I was editing a LaTeX document, while running PDF generation, while thinking about the analysis, while having Pluto open, while reading supervisor comments, while going through StatsBase and GLM and I should probably not have filed an issue immediately and instead let it sit for a day! My apologies.

What if we return the raw StatsModels.TableRegressionModel as part of the report?

Yes I do think that that could work. That would make already some extra functions available I think.

If it's an absolute must to have the whole glm_object returned, we could add it to the report rather than fitresult and add a key to that controls whether or not this is returned and also add a warning about memory concerns.

Well I don't know if it's an absolute must. What I needed, for example, was a different p-value than the default < 0.05 cutoff. To get this, I had to search through the StatsModels, GLM, etc etc codebases. Luckily, Survival.jl had:

pvalues = 2 .* Distributions.cdf(Distributions.Normal(), -abs.(z_scores))

and the GLM returns the z-score which you can extract by parsing the returned text 😅.

So I understand that most of our problems here were caused by GLM having a weird API from the perspective of machine learning (storing data, not having much keywords, having very elaborate types).

Maybe we should just let it be? My problem is solved for now via manually calculating p-values and parsing the text. Again. Apologies for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants