[ENH] Histogram distribution #382

ShreeshaM07 · 2024-06-11T17:25:34Z

Reference Issues/PRs

fixes #323
PR is a new one with updated main merged with #335 as there were some merge conflicts in __init__.py in distributions and also distributions.rst in api_reference.

What does this implement/fix? Explain your changes.

Implements the histogram distribution using bins and bin_mass as the parameters.

Does your contribution introduce a new dependency? If yes, which one?

No

Did you add any tests for the change?

Yes

Any comments

All the discussion regarding the Histogram discussion has been discussed in the PR #335.

PR checklist

For all contributions

I've added myself to the list of contributors with any new badges I've earned :-)
How to: add yourself to the all-contributors file in the skpro root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
See here for full badge reference
The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.

For new estimators

I've added the estimator to the API reference - in docs/source/api_reference/taskname.rst, follow the pattern.
I've added one or more illustrative usage examples to the docstring, in a pydocstyle compliant Examples section.
If the estimator relies on a soft dependency, I've set the python_dependencies tag and ensured
dependency isolation, see the estimator dependencies guide.

fkiraly · 2024-06-12T10:37:47Z

merged #381 into main, and then from main into this.

fkiraly

Great! Really nice work, I like the design thinking in the new base class.

Before we merge, the current "template" pattern does not allow override of the public methods pdf etc, or we diverge from the template pattern, see https://github.com/kamranahmedse/design-patterns-for-humans for the pattern.

Could you think of the best way to avoid the override? Or, let me know if you think this is not possible with the current design.

We do not need to merge the two base classes together btw, just avoid the override.

This reverts commit bfac335.

This reverts commit f9290590c65eb99dcf60ea36a2e6fbbbfac7f9c3.

ShreeshaM07 · 2024-06-13T18:32:00Z

I hope this will suffice as I am not overriding any public methods anymore.

ShreeshaM07 · 2024-06-16T19:49:30Z

@fkiraly , All the checks have passed after the overriding of public methods were avoided as well. Are we waiting on something for this merge?

ShreeshaM07 · 2024-06-18T12:17:57Z

@fkiraly , I modified the mean and variance to be calculated after vectorizing the bins and bin_mass and found the efficiency to be improved by only around 5 to 6 times. This is excluding the time taken to vectorize the ragged inputs.

These are the outputs

current mean time
vectorized mean time
current variance time
vectorized variance time

This involved changing most of the code just for the mean and variance itself which doesn't have all the if conditions like in pdf,cdf,ppf etc which will make slightly more complex. I am not very sure if the effort is worth the reward. If we really would like to have it then maybe we can do it in a separate PR as it will involve a lot of changes to the code.

Thoughts?

fkiraly · 2024-06-20T09:25:14Z

a factor of 5 or 6 is considered substantial, so I would advise we check how much time the conversion of arrays requires.

I would then proceed based on the outcoe:

if overall runtimes are comparable, I would advise we merge whatever functions we have currently vectorized, and open an issue. This is because that means, a different parameter representation might eventually make the computations more performant.
if overall runtime with conversion is worse, I would advise we pin the matter in the form of an issue and merge the current version.
If it still is a substantial speed-up, I would advise vectorizing a few key methods in addition.

ShreeshaM07 · 2024-06-20T18:26:38Z

This is including the vectorization time. With 2 cases of vectorization time inclusion case 1 when it is not vectorized in init but in separate functions and case 4 when it is vectorized in init and not in separate functions.

mean
- vectorization time included in each function
- vectorization time not included but vectorized
- No vectorization at all
- vectorized in init itself not in separate methods(i.e sum of time to vectorize + sum of time to run the mean operations after vectorizing)
variance
- vectorization time included in each function
- vectorization time not included but vectorized
- no vectorization at all
- vectorized in init itself not in separate methods(i.e sum of time to vectorize + sum of time to run the mean operations after vectorizing)

As you can see it actually gets worse when it is vectorized in __init__. The time for vectorizing these ragged inputs itself is about 0.0002 secs after which it takes time to calculate mean, var etc. So even though the time to calculate variance etc gets optimised by a factor of 5 or 6. The initializing itself is taking a lot of time.
This should come under the category 2

if overall runtime with conversion is worse, I would advise we pin the matter in the form of an issue and merge the current version.

Please do give your take based on your inference of the data.

fkiraly · 2024-06-21T10:50:59Z

Hm, I would then agree that the payoff is perhaps not worth the effort at the moment.

I would suggest to summarize the findings and outstanding questions in an issue, and perhaps at a later time point review together:

the design of the intermediate base class from this PR
the designs and parametrizations of the histogram distribution and two similar distributions: Empirical, and Mixture

ShreeshaM07 · 2024-06-21T11:35:03Z

Hm, I would then agree that the payoff is perhaps not worth the effort at the moment.

I would suggest to summarize the findings and outstanding questions in an issue, and perhaps at a later time point review together:

the design of the intermediate base class from this PR

the designs and parametrizations of the histogram distribution and two similar distributions: Empirical, and Mixture

Yeah I will summarize the findings of this PR in an issue.

[ENH] Histogram distribution

6db78c2

ShreeshaM07 mentioned this pull request Jun 11, 2024

[ENH] Histogram distribution #335

Closed

5 tasks

updated docstring to look better

bd5b6b5

ShreeshaM07 mentioned this pull request Jun 11, 2024

[ENH] histogram distribution #323

Closed

ShreeshaM07 added 4 commits June 11, 2024 23:45

update docstring and BaseArrayDistribution in distribution.rst

6dd7474

docstring

3a79d42

docstring

a8358d4

singular array cdf

108176e

ShreeshaM07 mentioned this pull request Jun 12, 2024

[BUG] fix test_methods_p logic when shuffle is True #381

Merged

ShreeshaM07 and others added 2 commits June 12, 2024 12:31

cdf single arr

1fb3b11

Merge branch 'main' into pr/382

b8fe76e

fkiraly assigned ShreeshaM07 Jun 13, 2024

fkiraly requested changes Jun 13, 2024

View reviewed changes

ShreeshaM07 and others added 5 commits June 13, 2024 23:54

Merge branch 'sktime:main' into dev2

0ea5255

stopped override of public methods

bfac335

removed test_all_distrs.py

aadb992

Revert "stopped override of public methods"

b575912

This reverts commit bfac335.

Revert "stopped override of public methods"

489da2a

This reverts commit f9290590c65eb99dcf60ea36a2e6fbbbfac7f9c3.

modified docstring

f6f5f2b

ShreeshaM07 mentioned this pull request Jun 17, 2024

[MENTEE] Shreesha M sktime/mentoring#40

Open

fkiraly approved these changes Jun 21, 2024

View reviewed changes

fkiraly merged commit 5afdcec into sktime:main Jun 22, 2024
36 checks passed

ShreeshaM07 mentioned this pull request Jun 23, 2024

[ENH] Improve efficiency of Histogram Distribution #405

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Histogram distribution #382

[ENH] Histogram distribution #382

ShreeshaM07 commented Jun 11, 2024 •

edited

Loading

fkiraly commented Jun 12, 2024

fkiraly left a comment •

edited

Loading

ShreeshaM07 commented Jun 13, 2024

ShreeshaM07 commented Jun 16, 2024

ShreeshaM07 commented Jun 18, 2024 •

edited

Loading

fkiraly commented Jun 20, 2024

ShreeshaM07 commented Jun 20, 2024

fkiraly commented Jun 21, 2024

ShreeshaM07 commented Jun 21, 2024

[ENH] Histogram distribution #382

[ENH] Histogram distribution #382

Conversation

ShreeshaM07 commented Jun 11, 2024 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Does your contribution introduce a new dependency? If yes, which one?

Did you add any tests for the change?

Any comments

PR checklist

For all contributions

For new estimators

fkiraly commented Jun 12, 2024

fkiraly left a comment • edited Loading

Choose a reason for hiding this comment

ShreeshaM07 commented Jun 13, 2024

ShreeshaM07 commented Jun 16, 2024

ShreeshaM07 commented Jun 18, 2024 • edited Loading

fkiraly commented Jun 20, 2024

ShreeshaM07 commented Jun 20, 2024

fkiraly commented Jun 21, 2024

ShreeshaM07 commented Jun 21, 2024

ShreeshaM07 commented Jun 11, 2024 •

edited

Loading

fkiraly left a comment •

edited

Loading

ShreeshaM07 commented Jun 18, 2024 •

edited

Loading