Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ResidualDouble regressor does not handle transforms correctly. #492

Open
meh2135 opened this issue Nov 13, 2024 · 8 comments · May be fixed by #503
Open

[BUG] ResidualDouble regressor does not handle transforms correctly. #492

meh2135 opened this issue Nov 13, 2024 · 8 comments · May be fixed by #503
Labels
bug module:regression probabilistic regression module

Comments

@meh2135
Copy link

meh2135 commented Nov 13, 2024

Describe the bug
In skpro.regression.residual.ResidualDouble, for the argument residual_trafo="squared", we should be taking the sqrt of the scale estimate before feeding it into the scale parameters, possibly followed by a further, distribution specific correction.

For instance, if we're using an error dist X~T(mu=0, dof=v) distribution, then for abs E[|X|] = ComplicatedFunc(dof) * sigma, and E[X^2]=Var(X)=sigma^2 * v / (v-2)

derivation from stack exchange

For arbitrary transformations, this quickly gets tricky, even for those limited distributions. Further, arbitrary transformations can't easily be checked for monotonicity or positivity. I would advise removing that capability.

To Reproduce

Running the following should yield uniform quantiles, but that totally breaks down for the squared transform.

from sklearn.model_selection import KFold
from matplotlib import pyplot as plt
import lightgbm as lgb
import numpy as np
import seaborn as sns
import pandas as pd
nn = 100_000
x_df = pd.DataFrame({"a":np.random.randn(nn), "b":np.random.randn(nn), "c":np.random.randn(nn)}).clip(-2,2)
degrees_of_freedom = 3.5
var_adj = (degrees_of_freedom / (degrees_of_freedom - 2))
# DGP
loc_param_vec = pd.Series({"a":-1, "b":1, "c":0})
log_scale_param_vec = pd.Series({"a":0, "b":0.01, "c":0.5})
loc_vec = x_df.dot(loc_param_vec)
log_scale_vec = x_df.dot(log_scale_param_vec)
dist = stats.t(df=degrees_of_freedom, loc=loc_vec, scale=np.exp(log_scale_vec))
y = pd.DataFrame(dist.rvs((2, nn)).T, index=x_df.index, columns=["y0", "y1"])
abs_reg = ResidualDouble(estimator=lgb.LGBMRegressor(verbosity=-1), estimator_resid=lgb.LGBMRegressor(verbosity=-1), distr_params={"df": degrees_of_freedom}, distr_type="t", residual_trafo="absolute", cv=KFold(n_splits=3))
sq_reg = ResidualDouble(estimator=lgb.LGBMRegressor(verbosity=-1), estimator_resid=lgb.LGBMRegressor(verbosity=-1), distr_params={"df": degrees_of_freedom}, distr_type="t", residual_trafo="squared", cv=KFold(n_splits=3))

abs_reg.fit(x_df, y["y0"])
sq_reg.fit(x_df, y["y0"])
abs_pred = abs_reg.predict_proba(x_df)
sq_pred = sq_reg.predict_proba(x_df)
abs_cdf = abs_pred.cdf(y[["y1"]])["y0"]
sq_cdf = sq_pred.cdf(y[["y1"]])["y0"]
cdf_df = pd.DataFrame({"abs": abs_cdf, "sq": sq_cdf})
sns.histplot(cdf_df, multiple="dodge", bins=10)

Confusion

I don't understand why it does yield uniform quantiles for the absolute transform for the t distribution. I would expect them to deviate from uniform because there is no degrees of freedom correction. Can anyone explain why that works?

Tacking the following code onto the end of the block above, we see that "raw" is the most uniform. That makes no sense to me.

from scipy.special import gamma
def abs_t_correction(dof):
    return 2 *np.sqrt(dof - 2) * gamma((dof + 1) / 2) / (np.sqrt(np.pi) * (dof - 1) * gamma(dof / 2))

dist = stats.t(df=degrees_of_freedom, loc=loc_vec, scale=np.exp(log_scale_vec))
dist_div = stats.t(df=degrees_of_freedom, loc=loc_vec, scale=np.exp(log_scale_vec) / abs_t_correction(degrees_of_freedom))
dist_mult = stats.t(df=degrees_of_freedom, loc=loc_vec, scale=np.exp(log_scale_vec) * abs_t_correction(degrees_of_freedom))
all_3_cdf = abs_pred.cdf(pd.DataFrame({"div":dist_div.rvs(nn), "mult":dist_mult.rvs(nn), "raw":dist.rvs(nn)}, columns=["div", "mult", "raw"]))
all_3_cdf.columns = ["div", "mult", "raw"]
sns.histplot(all_3_cdf, multiple="dodge", bins=10)
@meh2135 meh2135 added the bug label Nov 13, 2024
@meh2135
Copy link
Author

meh2135 commented Nov 14, 2024

Working on a fix here: https://github.com/meh2135/skpro

@meh2135
Copy link
Author

meh2135 commented Nov 14, 2024

After thinking about it a bit, I'm even more convinced that there really isn't a good way of reasonably estimating a scale parameter by allowing the user to specify an arbitrary transform of the residual, and modeling that.

I'm going to remove that option in my PR.

Re the adjustment for absolute for student's t, you can see the issue here:

from skpro.distributions import TDistribution
from scipy import stats
from matplotlib import pyplot as plt
import numpy as np
n_samples = 40_000
dof_vec = np.linspace(2.1, 7.0, 20)
ratio_vec = []
for dof in dof_vec:
    # Generate data from a t-distribution
    gen_dist = stats.t(df=dof, loc=0.0, scale=2.0)
    gen_samp = gen_dist.rvs(n_samples)
    # Estimate sigma from the data using the naive approach used in ResidualDouble, ie sigma = mean(abs(y))
    naive_sigma_est = np.abs(gen_samp).mean()
    # Plug that in to a skpro t-distribution'
    naive_skpro_dist = TDistribution(df=dof, mu=0.0, sigma=naive_sigma_est)
    # Sample from that skpro distribution
    naive_skpro_samples = naive_skpro_dist.sample(n_samples)
    # Calculate the variance of the data and the variance of the skpro samples
    gen_var = gen_samp.var()
    naive_skpro_var = naive_skpro_samples[0].var()
    # print(f"Naive sigma estimate: {naive_sigma_est:.4f}")
    # print(f"Naive skpro var: {naive_skpro_var:.4f}")
    # print(f"Ratio of variances: {naive_skpro_var / gen_var:.4f}")
    ratio_vec.append(naive_skpro_var / gen_var)

plt.figure()
plt.plot(dof_vec, ratio_vec)

plt.figure()
plt.plot(dof_vec, np.sqrt(ratio_vec))

I don't see the correction being made anywhere in the ResidualDouble code. That said, the quantiles that come out of it look decent, so maybe I am missing something.

FYI: @fkiraly

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 16, 2024

@meh2135, thanks a lot for your report and work on this!

I agree that the inverse transform is missing in _predict - I think we should add this for all cases. This is very astutely observed, really great!

However, I would split work on adding back the missing inverse transform from modifications of the algorithm.

There is a simple software engineering reason for that, we should not change anything in the package that may break user code - at least not without giving warnings for a few months and then changing it.
Here is the change/deprecation management guide: https://www.sktime.net/en/latest/developer_guide/deprecation.html

Removing an option would be such a change that could break code of users that use the option.

On the content side, that is, regarding the algorithm itself, I get what you are saying about violated statistical assumptions, we should keep in mind though that the algorithm is a simple heuristic, and also a common "rule of thumb" estimator. Of course there are better ways to estimate the residual distribution, under statistical assumptions.

For this, I would recommend discussing more systematically the target state first. Perhaps writing down a mock docstring of what you think the "best end state" should be. Because if you implement first, we might end up writing unnecessary code if in the end we agree on a different docstring.

@fkiraly
Copy link
Collaborator

fkiraly commented Nov 16, 2024

I don't understand why it does yield uniform quantiles for the absolute transform for the t distribution. I would expect them to deviate from uniform because there is no degrees of freedom correction. Can anyone explain why that works?

Regarding this, I think this is simply a model mismatch issue. Your data were not t-distributed residuals to start with, so it may or may not be that a deviation from an incorrect assumption (t-distribution with certain df) leads to a better fit!

That is also related to what I was saying about this algorithm being a heuristic. In case of model mismatch - and in real life nothing ever is exactly normally or t-distributed (for single samples, i.e., large sample statistics approximately do follow limit distributions etc). So, if you have model mismatch anyway, one or the other version of a heuristic with model mistmatch may be better.

The "machine learner" approach would be to take algorithms as they are and not be too concerned with "wrong statistical assumptions", but do stringent statistics for empirical performance evaluation.

@fkiraly fkiraly added the module:regression probabilistic regression module label Nov 16, 2024
@fkiraly
Copy link
Collaborator

fkiraly commented Nov 30, 2024

@meh2135, are you still working on this? Would be really nice to have!

Feel free to make a Pr with an incomplete version, and someone else can complete it, in case you do not have time to finish.

@meh2135
Copy link
Author

meh2135 commented Dec 2, 2024

@fkiraly hey sorry I've been so slow! I'm onsite at my job these past few weeks, but I'll be home next week and should have bandwidth to finish it then.

@fkiraly
Copy link
Collaborator

fkiraly commented Dec 2, 2024

No problem, just checking that it is not just a forgotten press of the pull request button or similar. No rush, and thanks for contributing!

@meh2135 meh2135 linked a pull request Dec 6, 2024 that will close this issue
@meh2135
Copy link
Author

meh2135 commented Dec 6, 2024

Draft PR ready for review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug module:regression probabilistic regression module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants