Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] native implementation of Johnson QPD family, explicit pdf #327

Merged
merged 20 commits into from
May 16, 2024

Conversation

fkiraly
Copy link
Collaborator

@fkiraly fkiraly commented May 15, 2024

This PR reworks the family of QPD family of distributions for efficiency and to allow removal of the newly introduced dependency findiff in #232.

The dependency findiff was introduced for approximation of pdf, but in fact it is unnecessary as the pdf can be analytically derived by applying the chain rule. True, it has to be applied three or four times, but it's still the chain rule... efficiency and accuracy gains are significant, and it helps us avoid computing numerical derivatives for all entries in a large matrix, together with the now unnecessary findiff dependency.

Makes the following changes:

  • refactoring of the three QPD distributions tp use skpro machinery:
    • use of the skpro native parameter broadcasting system instead of ad-hoc broadcasting
    • use of the skpro native approximation for mean, var, instead of three copies of similar (and partially duplicative) approximation inside the distributions
  • refactoring between the three QPD distributions with the end of simplification
    • refactoring QPD parameter computation into a single, fully vectorized function, _prep_qpd_vars
  • clean room reimplementation of cdf, ppf of the three distributions based on the cyclic_boosting reference
  • new implementation of pdf, as derivative of cdf

As side effects of the rework:

  • all parameters now broadcast in numpy-like fashion, including alpha, lower, upper, which previously was not possible
  • the distributions can be 2D with more than 1 column, which previously was not possible
  • version (the base distribution) can now be an arbitrary continuous scipy distribution
  • pdf is numerically exact
  • the distributions do not have soft dependencies anymore

Regarding the relation to cyclic_boosting:

  • this is clean room reimplementation and credit is given, so I hope this is fine license-wise - @FelixWick?
  • this is the result of trying to remove the findiff dependency for computing the pdf from the cdf that was introduced in [ENH] update Johnson QPDistributions with bugfixes and vectorization (cyclic-boosting ver.1.4.0) #232, as well as cleanup before release. I ended up simplifying a lot, ending up here. In this sense, the work of @setoguchi-naoki was crucial in arriving at this point.
  • I would have no issue at all with you moving the improved code into cyclic_boosting. We can even restore the dependency and maintain the distribution logic in cyclic_boosting if that were your preference, e.g., for ownership reasons.

FYI @FelixWick, @setoguchi-naoki, review appreciated - I would like to remove the findiff dependency before the next release.

@fkiraly fkiraly added enhancement module:probability&simulation probability distributions and simulators labels May 15, 2024
fkiraly added 2 commits May 15, 2024 02:22
This reverts commit c4b4d6e.
This reverts commit 40b9c2b.
fkiraly added a commit that referenced this pull request May 15, 2024
PR #327 shows that the `findiff`
soft dependency is not necessary - with this, I hence remove it.

We should either merge #327, or leave the QPDs with a soft dependency
that users then have to install independently.
@FelixWick
Copy link

License-wise no concerns from my side.
I'm also fine with keeping the code in this repository.
I will have a closer look at the code tomorrow.

@fkiraly
Copy link
Collaborator Author

fkiraly commented May 15, 2024

Looking forward to the review!

This PR should be considered together with #330, which implements an improved integral heuristic for mean, var.

I think the heuristic in #330 is better than what you had in the distribution (first take derivative of cdf to obtain pdf, then integrate against x) and the current base class heuristic (Monte Carlo based on sample).

Besides the fact that the QPD rely on heuristics for their mean and variance, there is no dependence between the two PR so they can be reviewed and worked on independently.

@fkiraly
Copy link
Collaborator Author

fkiraly commented May 16, 2024

merging for the release to avoid findiff dependency - please let me know if we should change anything, we can do so for the next release or, if necessary, in a hotfix.

@fkiraly fkiraly merged commit dad8793 into main May 16, 2024
34 checks passed
fkiraly added a commit that referenced this pull request May 16, 2024
Deprecation messages for `CyclicBoosting` for deprecations and changes
in 2.4.0.
Alternative to #320.

* restores sequence of parameters to sequence in version 2.2.2, to avoid
breakage in the 2.3.0 release
* adds deprecation message for `bound` parameter
* renames new `version` parameter to `dist_type` - the name is a bit
misleading, as it will probably imply "version" in the semantic
versioning sense to an ordinary user. Does not require deprecation,
because added since 2.2.2.

Also makes improvements to the docstring of the `CyclicBoosting`
estimator.

Difference to #320: does not carry out deprecations for distributions,
under a working assumption that #327 will be merged.
version=version,
)
super().__init__(index=index, columns=columns)
in_sinh = np.arcsinh(phi.ppf(p) * delta) + np.arcsinh(n * c * delta)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most of the cases, you probably want to do a np.outer rather than an element-wise multiplication here. (In the Cyclic Boosting implementation I made that optional.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in most cases imo, but I do see how this would be useful in many cases.

This is, however, a framework level concern, so I moved it into a new issue: #341

Btw, if np.ndarray of dim 3 could be passed, the numpy native upcasting would already provide the kind of behaviour that is described in #341.


phi = self.phi

in_arcsinh = np.log((x - lower) / theta) / kappa

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most of the cases, you probably want to do a np.outer rather than an element-wise multiplication here. (In the Cyclic Boosting implementation I made that optional.)

cdf = cdf[:, np.newaxis]
var = var_func(x, mean, cdf.T, index.shape[0])
return var
in_cdf = xi + kappa * np.sinh(delta * (phi.ppf(p) + n * c))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most of the cases, you probably want to do an outer product (e.g., by means of p[:, np.newaxis]) rather than an element-wise multiplication here. (In the Cyclic Boosting implementation I made that optional.)


phi = self.phi

phi_ppf = phi.ppf((x - lower) / rnge)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most of the cases, you probably want to do an outer product (e.g., by means of p[:, np.newaxis]) rather than an element-wise multiplication here. (In the Cyclic Boosting implementation I made that optional.)

@FelixWick
Copy link

Drop the sinhlogistic version from the docstring, it was only used in the extended QPD versions (which we dropped).

@FelixWick
Copy link

There should be no upper bound for QPD_S.

@FelixWick
Copy link

I would suggest to drop the QPD_U, because this is one of the extended QPD versions. I'm not sure it works properly in the given implementation.

ppf_part = ppf_all[t][c]
ppf[r][c] = ppf_part
return ppf
# we work through the chain rule for the entire nested expression in cdf

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check the math, but I assume you have compared the result with the numeric version?

Copy link

@FelixWick FelixWick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check all details of the reimplementation, but I assume you compared results to make sure it is consistent. Looks good in general. A few comments inline.

@fkiraly
Copy link
Collaborator Author

fkiraly commented May 17, 2024

There should be no upper bound for QPD_S.

The upper bound was already there, in the distribution initially implemented. We can of course remove it, as I am not sure where it is used. I left it since I was apprehensive of breakages coming from public access of the interface, we will have to remove via deprecation process.

@fkiraly
Copy link
Collaborator Author

fkiraly commented May 17, 2024

I didn't check all details of the reimplementation, but I assume you compared results to make sure it is consistent.

Yes, spot checks only though.

There is a test that checks against the case where we expect a uniform distribution.

@setoguchi-naoki
Copy link
Contributor

There should be no upper bound for QPD_S.

The upper bound was already there, in the distribution initially implemented. We can of course remove it, as I am not sure where it is used. I left it since I was apprehensive of breakages coming from public access of the interface, we will have to remove via deprecation process.

This was introduced for the use of findiff; it can be removed if findiff is not required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement module:probability&simulation probability distributions and simulators
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants