-
-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added ufloat_from_sample function #277
base: master
Are you sure you want to change the base?
Conversation
uncertainties uses pytest.
will run the tests see https://docs.pytest.org/en/stable/getting-started.html for a resource you'll want to add a test/tests similar to uncertainties/tests/test_unumpy.py Line 277 in 969324d
testing some example calculations, using assert result_of_your_function == expected_result
|
Thank you.
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #277 +/- ##
==========================================
+ Coverage 96.50% 96.52% +0.01%
==========================================
Files 16 16
Lines 1919 1926 +7
==========================================
+ Hits 1852 1859 +7
Misses 67 67
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
See my comments at #276. Not sure if the discussion should continue here or there. I'm initially wary about adding this in. I see the utility, but I wonder if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, and the use of the builtin statistics
module when Numpy is not available.
Following the precedent from that statistics
package, I recommend
ufloat_from_samples
over ufloat_from_sample
.
See https://docs.python.org/3/library/statistics.html#statistics.NormalDist.from_samples
I would say this OK in core.py
.... we could consider moving it somewhere if needed.
otypes=[object], | ||
)(mean_value, error_on_mean) | ||
else: | ||
msg={ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just a f-string:
msg = f"method='{method}' is not implemented"
if numpy is None: | ||
#if numpy is not present, use pythons statistics functions instead | ||
mean_value=stats_mean(sample) | ||
error_on_mean=stats_stdev(sample)/sqrt(len(sample)-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the statistics.stdev
already divides by sqrt(len(samples)-1)
, and it should not be done again here.
But also, I think we will need to discuss/document this, and perhaps have an option for using the "-1" ("Bessel's -1"), perhaps adding a ddof
argument (default=0). I think the default should be "match numpy.std()", which divides variance by sqrt(len(samples))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, i mistakenly assumed that statistics.stdev
behaved the same as numpt.std
does without any optional arguments.
Adding ddof=0
as an optional parameter should be simple. For the non-numpy code I can use statistics.pstdev
and then include "-ddof" in the denominator.
else: | ||
sample_size=numpy.shape(sample)[axis] | ||
|
||
error_on_mean=numpy.std(sample, ddof=1, axis=axis)/numpy.sqrt(sample_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the plain ndarray.std()
should be used by default, and ddof
should default to 1, and that dividing by the sample size here is not correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand what you mean. The error on the mean is the variance of the sample divided by the square root of the sample size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant ddof should default to 0, as in numpy.std()
.
Dividing by sqrt(len(samples)) gives the standard deviation of the mean, not the standard deviation of the value.
The standard deviation of the mean tells you how much you expect the mean value to change if you select a different set of samples/measurements from a larger pool of values.
For propagating uncertainties, you (usually, normally?) want the standard deviation of the value itself.
I think two good rule-of-thumbs are "what does Wikipedia say?" and "what does numpy do?".
Taking the basic example at https://en.wikipedia.org/wiki/Standard_deviation#Basic_examples:
>>> import numpy as np
>>> samples = [2, 4, 4, 4, 5, 5, 7, 9]
>>> print(np.mean(samples), np.std(samples))
5.0 2.0
>>> print(np.mean(samples), np.std(samples), ddof=1)
5.0 2.13808993529939
which agrees with Wikipedia. I would be reluctant to have
>>> ufloat_from_samples([2, 4, 4, 4, 5, 5, 7, 9])
return anything except ufloat(5.0, 2.0)
- that would require extra explanation and documentation.
An option to divide by sqrt(len(samples)-ddof)
would be OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I understand now.
I originally intended the function to return the mean and standard error on the mean. This Is because if I am measuring a constant value where my measurement apparatus have some random error, then my measurements are a sample from a normal distribution. Where the best estimate for the value is the mean of the sample and the uncertainty is the uncertainty on the mean.
Perhaps it could be a method "from measurement". I think this would be more clear anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Myles244 if your samples are from a normal distribution, and you are measuring noisy values from an experimental measurement, then I think you do want to use the mean and the standard deviation (as from numpy.mean() and numpy.std()). Those give the average (and most likely, for normal distributions) and the appropriate measure of the variation observed.
The standard deviation of the mean is a useful quantity, but I think it is not what you want here. Let's take mass of Christmas trees (and just to keep things jolly, let's pretend those have Gaussian distributions), and the total sample size in the survey is 1 million trees. A random sample of 100 would be small, but give a mean and standard deviation, and not be terrible. If you select a different 100 samples, you would have a mean and standard deviation that was pretty consistent with the first sample of 100. If you increase the sample size to 1000, you would not expect the mean or the standard deviation to change by much from the sample of 100. But, if you took a second sample of 1000, you would expect that mean value to be closer to the mean for the first 1000 than was the case for the two samples of 100.
Another classic way to think about it is with test scores across many classrooms: give the same test to all students in 10 classrooms of 25 students each. The standard deviation of the 250 scores tells you the variation between students. The standard deviation of the mean of each classroom tells you the variation between classrooms. One is a measure of the students; the other is a measure of the teachers (or a bias in how students are assigned classrooms, so the administration, perhaps) ;).
If you are measuring voltages and see a distribution [2, 4, 4, 4, 5, 5, 7, 9], the standard deviation is 2. If you repeat the measurements 100 times and the values are [[2, 4, 4, 4, 5, 5, 7, 9]*100], the standard deviation is still 2.
OTOH, if those are 100 different samples of a set of 8 different, possibly related, voltages, then the standard deviation of the mean of those 100 observations of 8 voltages is now much smaller than 2 -- the mean value is very consistent (across those 100 "classrooms"). But to know the difference, I think you would need to provide the 100 mean and standard deviations.
I think the most natural and common use case would be to have a bunch of samples of a single quantity, in which case the standard deviation is wanted. An option to divide by sqrt(len(samples)-ddof)
would be OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jagerber48 I think I have to agree with you; clearly, the function is ambiguous, and perhaps it would be better not to include it.
Thanks, anyway.
p.s. should I close the issue or leave it open?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can leave it open for now so we can continue to consider it for at least a few days. While I like the idea of having this as a constructor on UFloat
I think it is too specific and there are too many different things people don't want. I'm more sympathetic to the idea of a module with a variety of useful functions like this one. But there they could be named and documented more specifically e.g. ufloat_from_sample_sem
, ufloat_from_sample_std
etc.
We could include e.g. those two constructor functions on UFloat
. But in any case, my opinion is that this "utility" work should be held off until we are able to complete some of the "core" maintenance targets we have on uncertainties
(re-architect the core classes, rework numpy
integration, eliminate reliance on confusing monkey patching, etc.). In the lifetime of uncertainties
this new group of maintainers is very new, so I think it makes sense to make changes, especially API changes, slowly as we learn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes please leave this open
the wiki page mentions the ambiguity:
When only a sample of data from a population is available, the term standard deviation of the sample or sample standard deviation can refer to either the above-mentioned quantity as applied to those data, or to a modified quantity that is an unbiased estimate of the population standard deviation (the standard deviation of the entire population).
so I agree with having explict names like ufloat_from_sample_sem
, ufloat_from_sample_std
or a required keyword
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewgsavage to be clear, that snippet is from the wikipedia page on standard deviation and is referring to the distinction between the sample standard deviation and an unbiased estimator for the population/distribution standard deviation.
Neither of these are the standard error on the mean, which is the uncertainty of an unbiased estimator of the population mean calculated using the sample mean.
So there are three possible options for what the std_dev
attribute on the returned UFloat
should be:
- The sample standard deviation
- An unbiased estimator for the population/distribution standard deviation calculated using
ddof != 0
- The standard error on the mean
I agree that these could be selected by using different functions or a required keyword argument.
If we want this stuff to be in uncertainties
then here's my proposal:
- We make a new module named
utils.py
(I don't think it should be namedutils.py
, we should try to come up with something better, just using that as a placeholder) - This module has one function called
ufloat_from_sample_std_dev
which accepts addof
argument, matching thenumpy
API. - This module has another function called
ufloat_from_sample_std_err
. I don't think any other arguments are necessary other than the samples.
I express a preference for following numpy
conventions rather than the built-in statistics
modules conventions. This is because numpy
is better documented than the statistics
module. It looks like the statistics
module is doing some guessing under-the-hood about what ddof
should equal but it's not really explaining it.
These functions should all work as expected when numpy
is unavailable. In fact, at first pass, maybe they should be implemented assuming numpy
is not available. Then we can do benchmarking to see if numpy
makes a measurable performance difference. If so, we can configure the code to use numpy
if it is available.
I do NOT think this function should support numpy arrays, at least at first pass. We are still sorting out interoperability with numpy. I think there's a decent chance we'll do something the wrong way and have to redo it if we try to include numpy support for these helper functions. In other words: I could tolerate including helper functions to create UFloat
, but not helper functions to create arrays of UFloat
or UArray
or something.
Other functions can be moved into utils.py
(under whatever name it gets).
- The functions to create sequences of
UFloat
from nominal values and a covariance or correlation matrix - The functions to calculate the covariance or correlation matrix from a sequence of
UFloat
- The
ufloat_from_sample_std_dev
function - The
ufloat_from_sample_std_err
function - The uncertainty-weighted mean calculation in unumpy.average: init #265
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that one downside of this proposal is that it misses the "slick" opportunity to include e.g. UFloat.from_sample_std_dev
. But this is ok. At this time, we don't construct UFloat
s with the class constructor directly anyways, it is done with the ufloat()
helper function. In the future we can consider moving away from ufloat
and make UFloat
the direct constructor. At that time we could consider adding helper methods onto the class for alternative construction methods. But right now, since construction is done through a function, I think it makes sense for alternative construction to also be done through a function.
@jagerber48 @Myles244 @andrewgsavage I have to admit that, even after re-visiting these comments, I am extremely uncomfortable using the standard deviation of the mean here. If there is a function to convert a set/list/ndarray of values to a ufloat and no other information is provided, I think that it must use the mean and standard deviation of the values supplied. I would need a lot of convincing if I am OK with having a def ufloat_from_samples(vals, ddof=0):
if np:
avals = np.array(vals)
return ufloat(avals.mean(), avals.std(ddof=ddof))
else:
std = statistics.pstdev(vals) if ddof == 0 else statistics.stdev(vals) # only allows ddof=0,1
return ufloat(statistics.mean(vals), std) A ufloat represents a mean or most likely value from a distribution and the standard deviation of that distribution so that uncertainties (of the distribution!) can be propagated Indeed, the docstrings and the docs for this code call the second quantity of a ufloat "the standard deviation" throughout, and the variable is called I do not see a need for a ufloat constructed from the standard error or the standard deviation from the mean. While the standard deviation of the mean is a useful statistic, it is not used for error propagation. |
We're all in agreement now that it would not be appropriate for Of course, the option remains to have no function that returns a ufloat whose std_dev is the standard error on the mean. But I say that if we do this then we should also exclude a function to return a ufloat whose std_dev is the sample standard deviation. |
I'm unsure how to write or run tests, so if someone could point me in the direction of a good resource, it would be much appreciated.