-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] native implementation of Johnson QPD family, explicit pdf #327
Conversation
License-wise no concerns from my side. |
Looking forward to the review! This PR should be considered together with #330, which implements an improved integral heuristic for I think the heuristic in #330 is better than what you had in the distribution (first take derivative of cdf to obtain pdf, then integrate against x) and the current base class heuristic (Monte Carlo based on Besides the fact that the QPD rely on heuristics for their mean and variance, there is no dependence between the two PR so they can be reviewed and worked on independently. |
merging for the release to avoid |
Deprecation messages for `CyclicBoosting` for deprecations and changes in 2.4.0. Alternative to #320. * restores sequence of parameters to sequence in version 2.2.2, to avoid breakage in the 2.3.0 release * adds deprecation message for `bound` parameter * renames new `version` parameter to `dist_type` - the name is a bit misleading, as it will probably imply "version" in the semantic versioning sense to an ordinary user. Does not require deprecation, because added since 2.2.2. Also makes improvements to the docstring of the `CyclicBoosting` estimator. Difference to #320: does not carry out deprecations for distributions, under a working assumption that #327 will be merged.
version=version, | ||
) | ||
super().__init__(index=index, columns=columns) | ||
in_sinh = np.arcsinh(phi.ppf(p) * delta) + np.arcsinh(n * c * delta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most of the cases, you probably want to do a np.outer rather than an element-wise multiplication here. (In the Cyclic Boosting implementation I made that optional.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not in most cases imo, but I do see how this would be useful in many cases.
This is, however, a framework level concern, so I moved it into a new issue: #341
Btw, if np.ndarray
of dim 3 could be passed, the numpy
native upcasting would already provide the kind of behaviour that is described in #341.
|
||
phi = self.phi | ||
|
||
in_arcsinh = np.log((x - lower) / theta) / kappa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most of the cases, you probably want to do a np.outer rather than an element-wise multiplication here. (In the Cyclic Boosting implementation I made that optional.)
cdf = cdf[:, np.newaxis] | ||
var = var_func(x, mean, cdf.T, index.shape[0]) | ||
return var | ||
in_cdf = xi + kappa * np.sinh(delta * (phi.ppf(p) + n * c)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most of the cases, you probably want to do an outer product (e.g., by means of p[:, np.newaxis]) rather than an element-wise multiplication here. (In the Cyclic Boosting implementation I made that optional.)
|
||
phi = self.phi | ||
|
||
phi_ppf = phi.ppf((x - lower) / rnge) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most of the cases, you probably want to do an outer product (e.g., by means of p[:, np.newaxis]) rather than an element-wise multiplication here. (In the Cyclic Boosting implementation I made that optional.)
Drop the sinhlogistic version from the docstring, it was only used in the extended QPD versions (which we dropped). |
There should be no upper bound for QPD_S. |
I would suggest to drop the QPD_U, because this is one of the extended QPD versions. I'm not sure it works properly in the given implementation. |
ppf_part = ppf_all[t][c] | ||
ppf[r][c] = ppf_part | ||
return ppf | ||
# we work through the chain rule for the entire nested expression in cdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check the math, but I assume you have compared the result with the numeric version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check all details of the reimplementation, but I assume you compared results to make sure it is consistent. Looks good in general. A few comments inline.
The upper bound was already there, in the distribution initially implemented. We can of course remove it, as I am not sure where it is used. I left it since I was apprehensive of breakages coming from public access of the interface, we will have to remove via deprecation process. |
Yes, spot checks only though. There is a test that checks against the case where we expect a uniform distribution. |
This was introduced for the use of findiff; it can be removed if findiff is not required. |
This PR reworks the family of QPD family of distributions for efficiency and to allow removal of the newly introduced dependency
findiff
in #232.The dependency
findiff
was introduced for approximation ofpdf
, but in fact it is unnecessary as thepdf
can be analytically derived by applying the chain rule. True, it has to be applied three or four times, but it's still the chain rule... efficiency and accuracy gains are significant, and it helps us avoid computing numerical derivatives for all entries in a large matrix, together with the now unnecessaryfindiff
dependency.Makes the following changes:
skpro
machinery:skpro
native parameter broadcasting system instead of ad-hoc broadcastingskpro
native approximation formean
,var
, instead of three copies of similar (and partially duplicative) approximation inside the distributions_prep_qpd_vars
cdf
,ppf
of the three distributions based on thecyclic_boosting
referencepdf
, as derivative ofcdf
As side effects of the rework:
alpha
,lower
,upper
, which previously was not possibleversion
(the base distribution) can now be an arbitrary continuousscipy
distributionpdf
is numerically exactRegarding the relation to
cyclic_boosting
:findiff
dependency for computing thepdf
from thecdf
that was introduced in [ENH] update Johnson QPDistributions with bugfixes and vectorization (cyclic-boosting ver.1.4.0) #232, as well as cleanup before release. I ended up simplifying a lot, ending up here. In this sense, the work of @setoguchi-naoki was crucial in arriving at this point.cyclic_boosting
. We can even restore the dependency and maintain the distribution logic incyclic_boosting
if that were your preference, e.g., for ownership reasons.FYI @FelixWick, @setoguchi-naoki, review appreciated - I would like to remove the
findiff
dependency before the next release.