[ENH] native implementation of Johnson QPD family, explicit pdf #327

fkiraly · 2024-05-15T01:20:27Z

This PR reworks the family of QPD family of distributions for efficiency and to allow removal of the newly introduced dependency findiff in #232.

The dependency findiff was introduced for approximation of pdf, but in fact it is unnecessary as the pdf can be analytically derived by applying the chain rule. True, it has to be applied three or four times, but it's still the chain rule... efficiency and accuracy gains are significant, and it helps us avoid computing numerical derivatives for all entries in a large matrix, together with the now unnecessary findiff dependency.

Makes the following changes:

refactoring of the three QPD distributions tp use skpro machinery:
- use of the skpro native parameter broadcasting system instead of ad-hoc broadcasting
- use of the skpro native approximation for mean, var, instead of three copies of similar (and partially duplicative) approximation inside the distributions
refactoring between the three QPD distributions with the end of simplification
- refactoring QPD parameter computation into a single, fully vectorized function, _prep_qpd_vars
clean room reimplementation of cdf, ppf of the three distributions based on the cyclic_boosting reference
new implementation of pdf, as derivative of cdf

As side effects of the rework:

all parameters now broadcast in numpy-like fashion, including alpha, lower, upper, which previously was not possible
the distributions can be 2D with more than 1 column, which previously was not possible
version (the base distribution) can now be an arbitrary continuous scipy distribution
pdf is numerically exact
the distributions do not have soft dependencies anymore

Regarding the relation to cyclic_boosting:

this is clean room reimplementation and credit is given, so I hope this is fine license-wise - @FelixWick?
this is the result of trying to remove the findiff dependency for computing the pdf from the cdf that was introduced in [ENH] update Johnson QPDistributions with bugfixes and vectorization (cyclic-boosting ver.1.4.0) #232, as well as cleanup before release. I ended up simplifying a lot, ending up here. In this sense, the work of @setoguchi-naoki was crucial in arriving at this point.
I would have no issue at all with you moving the improved code into cyclic_boosting. We can even restore the dependency and maintain the distribution logic in cyclic_boosting if that were your preference, e.g., for ownership reasons.

FYI @FelixWick, @setoguchi-naoki, review appreciated - I would like to remove the findiff dependency before the next release.

This reverts commit c4b4d6e.

This reverts commit 40b9c2b.

PR #327 shows that the `findiff` soft dependency is not necessary - with this, I hence remove it. We should either merge #327, or leave the QPDs with a soft dependency that users then have to install independently.

FelixWick · 2024-05-15T05:59:30Z

License-wise no concerns from my side.
I'm also fine with keeping the code in this repository.
I will have a closer look at the code tomorrow.

fkiraly · 2024-05-15T14:20:16Z

Looking forward to the review!

This PR should be considered together with #330, which implements an improved integral heuristic for mean, var.

I think the heuristic in #330 is better than what you had in the distribution (first take derivative of cdf to obtain pdf, then integrate against x) and the current base class heuristic (Monte Carlo based on sample).

Besides the fact that the QPD rely on heuristics for their mean and variance, there is no dependence between the two PR so they can be reviewed and worked on independently.

fkiraly · 2024-05-16T14:42:42Z

merging for the release to avoid findiff dependency - please let me know if we should change anything, we can do so for the next release or, if necessary, in a hotfix.

Deprecation messages for `CyclicBoosting` for deprecations and changes in 2.4.0. Alternative to #320. * restores sequence of parameters to sequence in version 2.2.2, to avoid breakage in the 2.3.0 release * adds deprecation message for `bound` parameter * renames new `version` parameter to `dist_type` - the name is a bit misleading, as it will probably imply "version" in the semantic versioning sense to an ordinary user. Does not require deprecation, because added since 2.2.2. Also makes improvements to the docstring of the `CyclicBoosting` estimator. Difference to #320: does not carry out deprecations for distributions, under a working assumption that #327 will be merged.

FelixWick · 2024-05-16T19:41:40Z

skpro/distributions/qpd.py

-            version=version,
-        )
-        super().__init__(index=index, columns=columns)
+        in_sinh = np.arcsinh(phi.ppf(p) * delta) + np.arcsinh(n * c * delta)


In most of the cases, you probably want to do a np.outer rather than an element-wise multiplication here. (In the Cyclic Boosting implementation I made that optional.)

not in most cases imo, but I do see how this would be useful in many cases.

This is, however, a framework level concern, so I moved it into a new issue: #341

Btw, if np.ndarray of dim 3 could be passed, the numpy native upcasting would already provide the kind of behaviour that is described in #341.

FelixWick · 2024-05-16T19:44:57Z

skpro/distributions/qpd.py

+
+        phi = self.phi
+
+        in_arcsinh = np.log((x - lower) / theta) / kappa


In most of the cases, you probably want to do a np.outer rather than an element-wise multiplication here. (In the Cyclic Boosting implementation I made that optional.)

FelixWick · 2024-05-16T19:48:06Z

skpro/distributions/qpd.py

-            cdf = cdf[:, np.newaxis]
-        var = var_func(x, mean, cdf.T, index.shape[0])
-        return var
+        in_cdf = xi + kappa * np.sinh(delta * (phi.ppf(p) + n * c))


In most of the cases, you probably want to do an outer product (e.g., by means of p[:, np.newaxis]) rather than an element-wise multiplication here. (In the Cyclic Boosting implementation I made that optional.)

FelixWick · 2024-05-16T19:48:56Z

skpro/distributions/qpd.py

+
+        phi = self.phi
+
+        phi_ppf = phi.ppf((x - lower) / rnge)


In most of the cases, you probably want to do an outer product (e.g., by means of p[:, np.newaxis]) rather than an element-wise multiplication here. (In the Cyclic Boosting implementation I made that optional.)

FelixWick · 2024-05-16T19:52:40Z

Drop the sinhlogistic version from the docstring, it was only used in the extended QPD versions (which we dropped).

FelixWick · 2024-05-16T19:56:15Z

There should be no upper bound for QPD_S.

FelixWick · 2024-05-16T20:00:34Z

I would suggest to drop the QPD_U, because this is one of the extended QPD versions. I'm not sure it works properly in the given implementation.

FelixWick · 2024-05-16T20:06:39Z

skpro/distributions/qpd.py

-                ppf_part = ppf_all[t][c]
-                ppf[r][c] = ppf_part
-        return ppf
+        # we work through the chain rule for the entire nested expression in cdf


I didn't check the math, but I assume you have compared the result with the numeric version?

FelixWick

I didn't check all details of the reimplementation, but I assume you compared results to make sure it is consistent. Looks good in general. A few comments inline.

fkiraly · 2024-05-17T17:52:36Z

There should be no upper bound for QPD_S.

The upper bound was already there, in the distribution initially implemented. We can of course remove it, as I am not sure where it is used. I left it since I was apprehensive of breakages coming from public access of the interface, we will have to remove via deprecation process.

fkiraly · 2024-05-17T17:53:05Z

I didn't check all details of the reimplementation, but I assume you compared results to make sure it is consistent.

Yes, spot checks only though.

There is a test that checks against the case where we expect a uniform distribution.

setoguchi-naoki · 2024-05-19T05:12:01Z

There should be no upper bound for QPD_S.

The upper bound was already there, in the distribution initially implemented. We can of course remove it, as I am not sure where it is used. I left it since I was apprehensive of breakages coming from public access of the interface, we will have to remove via deprecation process.

This was introduced for the use of findiff; it can be removed if findiff is not required.

fkiraly added 10 commits May 15, 2024 00:24

qpd b initial

b97c4a9

fixes, test

6e49d6d

refactor

619196a

qpd_s

545d1fe

Update qpd.py

6af9b5b

remove findiff dep

4ca9d15

qpd_u

07c04e4

rename params

c585017

remove deps

8606c1b

remove dead code

c4b4d6e

fkiraly added enhancement module:probability&simulation probability distributions and simulators labels May 15, 2024

fkiraly mentioned this pull request May 15, 2024

[MNT] remove findiff soft dependency #328

Merged

fkiraly added 2 commits May 15, 2024 02:22

Revert "remove dead code"

40b9c2b

This reverts commit c4b4d6e.

Reapply "remove dead code"

52e73b3

This reverts commit 40b9c2b.

fkiraly added 5 commits May 15, 2024 02:25

Update qpd.py

fe66e0f

Update qpd.py

15284ca

fixes

40ffcc8

Merge branch 'main' into jqpd-simplify

ec36a8e

linting

0f668c0

fkiraly added 2 commits May 15, 2024 10:46

qpd johnson

a8aed1e

fix test

73eb853

fkiraly mentioned this pull request May 15, 2024

[MNT] deprecation handling for CyclicBoosting #329

Merged

Merge branch 'main' into jqpd-simplify

bd956e3

fkiraly merged commit dad8793 into main May 16, 2024
34 checks passed

fkiraly mentioned this pull request May 16, 2024

[MNT] Deprecation message for CyclicBoosting changes #320

Merged

5 tasks

FelixWick reviewed May 16, 2024

View reviewed changes

fkiraly mentioned this pull request May 17, 2024

[ENH] numpy and outer product functionality for distribution methods #341

Open

fkiraly mentioned this pull request May 22, 2024

[ENH] QPD_B mean and var fail due to numeric instability #257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] native implementation of Johnson QPD family, explicit pdf #327

[ENH] native implementation of Johnson QPD family, explicit pdf #327

fkiraly commented May 15, 2024

FelixWick commented May 15, 2024

fkiraly commented May 15, 2024 •

edited

Loading

fkiraly commented May 16, 2024

FelixWick May 16, 2024

fkiraly May 17, 2024

FelixWick May 16, 2024

FelixWick May 16, 2024

FelixWick May 16, 2024

FelixWick commented May 16, 2024

FelixWick commented May 16, 2024

FelixWick commented May 16, 2024

FelixWick May 16, 2024

FelixWick left a comment

fkiraly commented May 17, 2024

fkiraly commented May 17, 2024

setoguchi-naoki commented May 19, 2024


		phi = self.phi

		in_arcsinh = np.log((x - lower) / theta) / kappa

[ENH] native implementation of Johnson QPD family, explicit pdf #327

[ENH] native implementation of Johnson QPD family, explicit pdf #327

Conversation

fkiraly commented May 15, 2024

FelixWick commented May 15, 2024

fkiraly commented May 15, 2024 • edited Loading

fkiraly commented May 16, 2024

FelixWick May 16, 2024

Choose a reason for hiding this comment

fkiraly May 17, 2024

Choose a reason for hiding this comment

FelixWick May 16, 2024

Choose a reason for hiding this comment

FelixWick May 16, 2024

Choose a reason for hiding this comment

FelixWick May 16, 2024

Choose a reason for hiding this comment

FelixWick commented May 16, 2024

FelixWick commented May 16, 2024

FelixWick commented May 16, 2024

FelixWick May 16, 2024

Choose a reason for hiding this comment

FelixWick left a comment

Choose a reason for hiding this comment

fkiraly commented May 17, 2024

fkiraly commented May 17, 2024

setoguchi-naoki commented May 19, 2024

fkiraly commented May 15, 2024 •

edited

Loading