-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Histogram distribution #335
Conversation
I have just implemented a minimal code so far. Would these parameters be appropriate where I am taking the parameters as increase the input array to the range
Does this idea seem fine? |
Hm, I do not quite get the parametrization here. We will have to say - let's say, in the scalar case only - where the bins start/end, and how much mass is in them. How is this achieved given the parameters? I'm not sure I understand, explanation would be appreciated! An example might help - let's say I have two bins, one from 0.5 to 2, and one from 2 to 7. The first bin has mass 0.3, the second 0.7. How would I construct the distribution? |
So in this example the bin_width would be passed as [1.5,5] and bin_density = [0.3,0.7] and when I pass the x values which is a 1D array/dataframe. Now for finding the values to split the axis into bins I will be using the min(x) and max(x) . But I think there will be a flaw in this way of doing it. It will be better to implement it using I will work on this tomorrow as I will not be working today hence will not be attending the stand up today please excuse me. |
Agreed, as you do not know where the first bin starts, this way, ie, at 0.5. I also think your new suggestion is better! |
I have done it that way but I still am not sure if we can handle a single
Output:
Does this parameterization make sense? |
Yes, I think it does make sense! I would rename the parameter |
skpro/distributions/histogram.py
Outdated
bin_density = np.array(self.bin_density.copy()) | ||
bins = self.bins | ||
pdf = [] | ||
if isinstance(bins, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main comment, this looks correct, but it is quite inefficient due to the use of loops.
I would strongly advise to use numpy
methods for everything.
For instance, for bin widths, use diff
.
To find the bin in which the x-value falls, you could use cumsum
and np.where
with >
.
skpro/distributions/histogram.py
Outdated
""" | ||
bins = self.bins | ||
# 1 is the cumulative sum of all bin_mass | ||
return 1 / (max(bins) - min(bins)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's not correct? Also, you need to be careful about the different cases of bins
being int or iterable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take care of the bins
cases. But in the case where bins
has the bin edges then shouldn't this be the mean as mean = sum(bin_width*bin_height)/sum(bin_width)
,the numerator is basically area under the histogram which is = 1
and the sum of bin_width would be the range of the bins
values thus = max(bins)- min(bins)
. Is that incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please review if what I have considered for the mean
and var
is correct or do I have to use E[X] = μ=∫∞−∞x*pdf(x)dx across all the different pdfs for the different bins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your formula is simply incorrect.
The correct formula for mean is:
Let
The mean of the histogram distirbution is then
which you can obtain by applying np.dot
and a shifted sum.
(this is obtained if you substitute pdf(x) into your formula and carry out the integration correctly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for "easy" computation of the mean and variance, you can use that the histogram distribution is the same as the two-step conditional where you first sample which bin you are in, with probabilities
Use the conditional formulae for mean and variance on this idea - this also shows why the mean has the above form, as the weighted mean of means of uniform distributions on the bin intervals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes that is correct, I've made a mistake I will correct it now. Thanks for the help.
A suggestion for another way to input |
Yes, I think that would be better than just Though, how would we distinguish this from an iterable of length 3, i.e., the other convention on giving the boundaries? Let's say we have start = 0, end = 1, number of bins = 2. This could also be two bins, 0 to 1 and 1 to 2. |
We can use Also any idea what I should be doing to make the CI tests pass?
without the
but if I run it with the
Is it perhaps related to the size of |
The CI results indicate that you ought to set the |
I have done that but there when I enable
Instead if I disable it and then run
From some debugging I have done it stops working when Any suggestions? |
This is probably because scalar histogram distributions have dimension 1 parameters - so the default broadcasting will not work. We either need to broadcast manually, or separate bin start/step/end into separate parameters. I would recommend going back to the drawing board quickly, and think about the parameterization in the case of array distributions. The closest distribution is perhaps |
From what I have discovered so far on running the
upon changing the
it gives
Upon observing this we can see that nothing is returned/printed when
If we can figure out why this Maybe we should try making a new |
Yes, I think a custom However, I also think |
Can you please give suggestions how I can identify if it is a single array distribution ?
This is technically equivalent to scalar input in other distributions like
it converts it in _get_bc_params_dict and stores it like
Currently I have So can you please suggest how I can efficiently check whether it is a single array distribution( |
Thanks for the quick spotting of the |
@fkiraly , In the |
All the checks have passed except the I will make another PR for |
fixes #323 PR is a new one with updated main merged with #335 as there were some merge conflicts in `__init__.py` in distributions and also `distributions.rst` in `api_reference`. #### What does this implement/fix? Explain your changes. Implements the histogram distribution using bins and bin_mass as the parameters.
Reference Issues/PRs
fixes #323
What does this implement/fix? Explain your changes.
Implements the histogram distribution using bin_width and bin_density as the parameters.
Does your contribution introduce a new dependency? If yes, which one?
No
What should a reviewer concentrate their feedback on?
Whether the chosen parameters are suitable to be used.
Did you add any tests for the change?
No
PR checklist
For all contributions
How to: add yourself to the all-contributors file in the
skpro
root directory (not theCONTRIBUTORS.md
). Common badges:code
- fixing a bug, or adding code logic.doc
- writing or improving documentation or docstrings.bug
- reporting or diagnosing a bug (get this pluscode
if you also fixed the bug in the PR).maintenance
- CI, test framework, release.See here for full badge reference
For new estimators
docs/source/api_reference/taskname.rst
, follow the pattern.Examples
section.python_dependencies
tag and ensureddependency isolation, see the estimator dependencies guide.