[BUG] ResidualDouble regressor does not handle transforms correctly. #492

meh2135 · 2024-11-13T22:23:08Z

Describe the bug
In skpro.regression.residual.ResidualDouble, for the argument residual_trafo="squared", we should be taking the sqrt of the scale estimate before feeding it into the scale parameters, possibly followed by a further, distribution specific correction.

For instance, if we're using an error dist X~T(mu=0, dof=v) distribution, then for abs E[|X|] = ComplicatedFunc(dof) * sigma, and E[X^2]=Var(X)=sigma^2 * v / (v-2)

derivation from stack exchange

For arbitrary transformations, this quickly gets tricky, even for those limited distributions. Further, arbitrary transformations can't easily be checked for monotonicity or positivity. I would advise removing that capability.

To Reproduce

Running the following should yield uniform quantiles, but that totally breaks down for the squared transform.

from sklearn.model_selection import KFold
from matplotlib import pyplot as plt
import lightgbm as lgb
import numpy as np
import seaborn as sns
import pandas as pd
nn = 100_000
x_df = pd.DataFrame({"a":np.random.randn(nn), "b":np.random.randn(nn), "c":np.random.randn(nn)}).clip(-2,2)
degrees_of_freedom = 3.5
var_adj = (degrees_of_freedom / (degrees_of_freedom - 2))
# DGP
loc_param_vec = pd.Series({"a":-1, "b":1, "c":0})
log_scale_param_vec = pd.Series({"a":0, "b":0.01, "c":0.5})
loc_vec = x_df.dot(loc_param_vec)
log_scale_vec = x_df.dot(log_scale_param_vec)
dist = stats.t(df=degrees_of_freedom, loc=loc_vec, scale=np.exp(log_scale_vec))
y = pd.DataFrame(dist.rvs((2, nn)).T, index=x_df.index, columns=["y0", "y1"])
abs_reg = ResidualDouble(estimator=lgb.LGBMRegressor(verbosity=-1), estimator_resid=lgb.LGBMRegressor(verbosity=-1), distr_params={"df": degrees_of_freedom}, distr_type="t", residual_trafo="absolute", cv=KFold(n_splits=3))
sq_reg = ResidualDouble(estimator=lgb.LGBMRegressor(verbosity=-1), estimator_resid=lgb.LGBMRegressor(verbosity=-1), distr_params={"df": degrees_of_freedom}, distr_type="t", residual_trafo="squared", cv=KFold(n_splits=3))

abs_reg.fit(x_df, y["y0"])
sq_reg.fit(x_df, y["y0"])
abs_pred = abs_reg.predict_proba(x_df)
sq_pred = sq_reg.predict_proba(x_df)
abs_cdf = abs_pred.cdf(y[["y1"]])["y0"]
sq_cdf = sq_pred.cdf(y[["y1"]])["y0"]
cdf_df = pd.DataFrame({"abs": abs_cdf, "sq": sq_cdf})
sns.histplot(cdf_df, multiple="dodge", bins=10)

Confusion

I don't understand why it does yield uniform quantiles for the absolute transform for the t distribution. I would expect them to deviate from uniform because there is no degrees of freedom correction. Can anyone explain why that works?

Tacking the following code onto the end of the block above, we see that "raw" is the most uniform. That makes no sense to me.

from scipy.special import gamma
def abs_t_correction(dof):
    return 2 *np.sqrt(dof - 2) * gamma((dof + 1) / 2) / (np.sqrt(np.pi) * (dof - 1) * gamma(dof / 2))

dist = stats.t(df=degrees_of_freedom, loc=loc_vec, scale=np.exp(log_scale_vec))
dist_div = stats.t(df=degrees_of_freedom, loc=loc_vec, scale=np.exp(log_scale_vec) / abs_t_correction(degrees_of_freedom))
dist_mult = stats.t(df=degrees_of_freedom, loc=loc_vec, scale=np.exp(log_scale_vec) * abs_t_correction(degrees_of_freedom))
all_3_cdf = abs_pred.cdf(pd.DataFrame({"div":dist_div.rvs(nn), "mult":dist_mult.rvs(nn), "raw":dist.rvs(nn)}, columns=["div", "mult", "raw"]))
all_3_cdf.columns = ["div", "mult", "raw"]
sns.histplot(all_3_cdf, multiple="dodge", bins=10)

meh2135 · 2024-11-14T01:21:09Z

Working on a fix here: https://github.com/meh2135/skpro

meh2135 · 2024-11-14T04:38:03Z

After thinking about it a bit, I'm even more convinced that there really isn't a good way of reasonably estimating a scale parameter by allowing the user to specify an arbitrary transform of the residual, and modeling that.

I'm going to remove that option in my PR.

Re the adjustment for absolute for student's t, you can see the issue here:

from skpro.distributions import TDistribution
from scipy import stats
from matplotlib import pyplot as plt
import numpy as np
n_samples = 40_000
dof_vec = np.linspace(2.1, 7.0, 20)
ratio_vec = []
for dof in dof_vec:
    # Generate data from a t-distribution
    gen_dist = stats.t(df=dof, loc=0.0, scale=2.0)
    gen_samp = gen_dist.rvs(n_samples)
    # Estimate sigma from the data using the naive approach used in ResidualDouble, ie sigma = mean(abs(y))
    naive_sigma_est = np.abs(gen_samp).mean()
    # Plug that in to a skpro t-distribution'
    naive_skpro_dist = TDistribution(df=dof, mu=0.0, sigma=naive_sigma_est)
    # Sample from that skpro distribution
    naive_skpro_samples = naive_skpro_dist.sample(n_samples)
    # Calculate the variance of the data and the variance of the skpro samples
    gen_var = gen_samp.var()
    naive_skpro_var = naive_skpro_samples[0].var()
    # print(f"Naive sigma estimate: {naive_sigma_est:.4f}")
    # print(f"Naive skpro var: {naive_skpro_var:.4f}")
    # print(f"Ratio of variances: {naive_skpro_var / gen_var:.4f}")
    ratio_vec.append(naive_skpro_var / gen_var)

plt.figure()
plt.plot(dof_vec, ratio_vec)

plt.figure()
plt.plot(dof_vec, np.sqrt(ratio_vec))

I don't see the correction being made anywhere in the ResidualDouble code. That said, the quantiles that come out of it look decent, so maybe I am missing something.

FYI: @fkiraly

fkiraly · 2024-11-16T09:02:54Z

@meh2135, thanks a lot for your report and work on this!

I agree that the inverse transform is missing in _predict - I think we should add this for all cases. This is very astutely observed, really great!

However, I would split work on adding back the missing inverse transform from modifications of the algorithm.

There is a simple software engineering reason for that, we should not change anything in the package that may break user code - at least not without giving warnings for a few months and then changing it.
Here is the change/deprecation management guide: https://www.sktime.net/en/latest/developer_guide/deprecation.html

Removing an option would be such a change that could break code of users that use the option.

On the content side, that is, regarding the algorithm itself, I get what you are saying about violated statistical assumptions, we should keep in mind though that the algorithm is a simple heuristic, and also a common "rule of thumb" estimator. Of course there are better ways to estimate the residual distribution, under statistical assumptions.

For this, I would recommend discussing more systematically the target state first. Perhaps writing down a mock docstring of what you think the "best end state" should be. Because if you implement first, we might end up writing unnecessary code if in the end we agree on a different docstring.

fkiraly · 2024-11-16T09:11:20Z

I don't understand why it does yield uniform quantiles for the absolute transform for the t distribution. I would expect them to deviate from uniform because there is no degrees of freedom correction. Can anyone explain why that works?

Regarding this, I think this is simply a model mismatch issue. Your data were not t-distributed residuals to start with, so it may or may not be that a deviation from an incorrect assumption (t-distribution with certain df) leads to a better fit!

That is also related to what I was saying about this algorithm being a heuristic. In case of model mismatch - and in real life nothing ever is exactly normally or t-distributed (for single samples, i.e., large sample statistics approximately do follow limit distributions etc). So, if you have model mismatch anyway, one or the other version of a heuristic with model mistmatch may be better.

The "machine learner" approach would be to take algorithms as they are and not be too concerned with "wrong statistical assumptions", but do stringent statistics for empirical performance evaluation.

fkiraly · 2024-11-30T21:19:17Z

@meh2135, are you still working on this? Would be really nice to have!

Feel free to make a Pr with an incomplete version, and someone else can complete it, in case you do not have time to finish.

meh2135 · 2024-12-02T16:16:49Z

@fkiraly hey sorry I've been so slow! I'm onsite at my job these past few weeks, but I'll be home next week and should have bandwidth to finish it then.

fkiraly · 2024-12-02T18:33:53Z

No problem, just checking that it is not just a forgotten press of the pull request button or similar. No rush, and thanks for contributing!

meh2135 · 2024-12-06T20:10:39Z

Draft PR ready for review

meh2135 added the bug label Nov 13, 2024

fkiraly added the module:regression probabilistic regression module label Nov 16, 2024

meh2135 linked a pull request Dec 6, 2024 that will close this issue

[BUG] Fix residual scale estimation in DoubleResidual #503

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ResidualDouble regressor does not handle transforms correctly. #492

[BUG] ResidualDouble regressor does not handle transforms correctly. #492

meh2135 commented Nov 13, 2024

meh2135 commented Nov 14, 2024

meh2135 commented Nov 14, 2024 •

edited

Loading

fkiraly commented Nov 16, 2024

fkiraly commented Nov 16, 2024

fkiraly commented Nov 30, 2024

meh2135 commented Dec 2, 2024

fkiraly commented Dec 2, 2024

meh2135 commented Dec 6, 2024 •

edited

Loading

[BUG] ResidualDouble regressor does not handle transforms correctly. #492

[BUG] ResidualDouble regressor does not handle transforms correctly. #492

Comments

meh2135 commented Nov 13, 2024

meh2135 commented Nov 14, 2024

meh2135 commented Nov 14, 2024 • edited Loading

fkiraly commented Nov 16, 2024

fkiraly commented Nov 16, 2024

fkiraly commented Nov 30, 2024

meh2135 commented Dec 2, 2024

fkiraly commented Dec 2, 2024

meh2135 commented Dec 6, 2024 • edited Loading

meh2135 commented Nov 14, 2024 •

edited

Loading

meh2135 commented Dec 6, 2024 •

edited

Loading