Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added ufloat_from_sample function #277

Open
wants to merge 21 commits into
base: master
Choose a base branch
from
Open

Conversation

Myles244
Copy link

@Myles244 Myles244 commented Dec 16, 2024

I'm unsure how to write or run tests, so if someone could point me in the direction of a good resource, it would be much appreciated.

@andrewgsavage
Copy link
Contributor

uncertainties uses pytest.

pip install ".[test]"      # to enable running the tests
pytest .

will run the tests

see https://docs.pytest.org/en/stable/getting-started.html for a resource

you'll want to add a test/tests similar to

def test_component_extraction():

testing some example calculations, using assert result_of_your_function == expected_result

@Myles244
Copy link
Author

Thank you.
I have now completed the following:

  • checked that it still passes all tests.
  • written an appropriate test for this new function.

Copy link

codecov bot commented Dec 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.52%. Comparing base (969324d) to head (82842e9).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #277      +/-   ##
==========================================
+ Coverage   96.50%   96.52%   +0.01%     
==========================================
  Files          16       16              
  Lines        1919     1926       +7     
==========================================
+ Hits         1852     1859       +7     
  Misses         67       67              
Flag Coverage Δ
macos-latest-3.10 94.96% <100.00%> (+0.01%) ⬆️
macos-latest-3.11 94.96% <100.00%> (+0.01%) ⬆️
macos-latest-3.12 94.96% <100.00%> (+0.01%) ⬆️
macos-latest-3.8 94.95% <100.00%> (+0.01%) ⬆️
macos-latest-3.9 94.95% <100.00%> (+0.01%) ⬆️
no-numpy 74.50% <0.00%> (-0.28%) ⬇️
ubuntu-latest-3.10 94.96% <100.00%> (+0.01%) ⬆️
ubuntu-latest-3.11 94.96% <100.00%> (+0.01%) ⬆️
ubuntu-latest-3.12 94.96% <100.00%> (+0.01%) ⬆️
ubuntu-latest-3.8 94.95% <100.00%> (+0.01%) ⬆️
ubuntu-latest-3.9 94.95% <100.00%> (+0.01%) ⬆️
windows-latest-3.10 94.96% <100.00%> (+0.01%) ⬆️
windows-latest-3.11 94.96% <100.00%> (+0.01%) ⬆️
windows-latest-3.12 94.96% <100.00%> (+0.01%) ⬆️
windows-latest-3.8 94.95% <100.00%> (+0.01%) ⬆️
windows-latest-3.9 94.95% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jagerber48
Copy link
Contributor

See my comments at #276. Not sure if the discussion should continue here or there. I'm initially wary about adding this in. I see the utility, but I wonder if uncertainties wants to get in the business of adding small helper functions like this or not. Also it needs to be made clear (via function names or documentation) that this function should only really be used with data that are known to be normally or approximately normally distributed.

Copy link
Member

@newville newville left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, and the use of the builtin statistics module when Numpy is not available.

Following the precedent from that statistics package, I recommend
ufloat_from_samples over ufloat_from_sample.
See https://docs.python.org/3/library/statistics.html#statistics.NormalDist.from_samples

I would say this OK in core.py.... we could consider moving it somewhere if needed.

otypes=[object],
)(mean_value, error_on_mean)
else:
msg={
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just a f-string:

     msg = f"method='{method}' is not implemented"

if numpy is None:
#if numpy is not present, use pythons statistics functions instead
mean_value=stats_mean(sample)
error_on_mean=stats_stdev(sample)/sqrt(len(sample)-1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the statistics.stdev already divides by sqrt(len(samples)-1), and it should not be done again here.

But also, I think we will need to discuss/document this, and perhaps have an option for using the "-1" ("Bessel's -1"), perhaps adding a ddof argument (default=0). I think the default should be "match numpy.std()", which divides variance by sqrt(len(samples)).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, i mistakenly assumed that statistics.stdev behaved the same as numpt.std does without any optional arguments.

Adding ddof=0 as an optional parameter should be simple. For the non-numpy code I can use statistics.pstdev and then include "-ddof" in the denominator.

else:
sample_size=numpy.shape(sample)[axis]

error_on_mean=numpy.std(sample, ddof=1, axis=axis)/numpy.sqrt(sample_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the plain ndarray.std() should be used by default, and ddof should default to 1, and that dividing by the sample size here is not correct.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand what you mean. The error on the mean is the variance of the sample divided by the square root of the sample size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant ddof should default to 0, as in numpy.std().

Dividing by sqrt(len(samples)) gives the standard deviation of the mean, not the standard deviation of the value.
The standard deviation of the mean tells you how much you expect the mean value to change if you select a different set of samples/measurements from a larger pool of values.

For propagating uncertainties, you (usually, normally?) want the standard deviation of the value itself.

I think two good rule-of-thumbs are "what does Wikipedia say?" and "what does numpy do?".
Taking the basic example at https://en.wikipedia.org/wiki/Standard_deviation#Basic_examples:

>>> import numpy as np
>>> samples = [2, 4, 4, 4, 5, 5, 7, 9]
>>> print(np.mean(samples), np.std(samples))
5.0 2.0
>>> print(np.mean(samples), np.std(samples), ddof=1)
5.0 2.13808993529939

which agrees with Wikipedia. I would be reluctant to have

>>> ufloat_from_samples([2, 4, 4, 4, 5, 5, 7, 9])

return anything except ufloat(5.0, 2.0) - that would require extra explanation and documentation.

An option to divide by sqrt(len(samples)-ddof) would be OK.

Copy link
Author

@Myles244 Myles244 Dec 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I understand now.

I originally intended the function to return the mean and standard error on the mean. This Is because if I am measuring a constant value where my measurement apparatus have some random error, then my measurements are a sample from a normal distribution. Where the best estimate for the value is the mean of the sample and the uncertainty is the uncertainty on the mean.

Perhaps it could be a method "from measurement". I think this would be more clear anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Myles244 if your samples are from a normal distribution, and you are measuring noisy values from an experimental measurement, then I think you do want to use the mean and the standard deviation (as from numpy.mean() and numpy.std()). Those give the average (and most likely, for normal distributions) and the appropriate measure of the variation observed.

The standard deviation of the mean is a useful quantity, but I think it is not what you want here. Let's take mass of Christmas trees (and just to keep things jolly, let's pretend those have Gaussian distributions), and the total sample size in the survey is 1 million trees. A random sample of 100 would be small, but give a mean and standard deviation, and not be terrible. If you select a different 100 samples, you would have a mean and standard deviation that was pretty consistent with the first sample of 100. If you increase the sample size to 1000, you would not expect the mean or the standard deviation to change by much from the sample of 100. But, if you took a second sample of 1000, you would expect that mean value to be closer to the mean for the first 1000 than was the case for the two samples of 100.

Another classic way to think about it is with test scores across many classrooms: give the same test to all students in 10 classrooms of 25 students each. The standard deviation of the 250 scores tells you the variation between students. The standard deviation of the mean of each classroom tells you the variation between classrooms. One is a measure of the students; the other is a measure of the teachers (or a bias in how students are assigned classrooms, so the administration, perhaps) ;).

If you are measuring voltages and see a distribution [2, 4, 4, 4, 5, 5, 7, 9], the standard deviation is 2. If you repeat the measurements 100 times and the values are [[2, 4, 4, 4, 5, 5, 7, 9]*100], the standard deviation is still 2.

OTOH, if those are 100 different samples of a set of 8 different, possibly related, voltages, then the standard deviation of the mean of those 100 observations of 8 voltages is now much smaller than 2 -- the mean value is very consistent (across those 100 "classrooms"). But to know the difference, I think you would need to provide the 100 mean and standard deviations.

I think the most natural and common use case would be to have a bunch of samples of a single quantity, in which case the standard deviation is wanted. An option to divide by sqrt(len(samples)-ddof) would be OK.

Copy link
Author

@Myles244 Myles244 Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jagerber48 I think I have to agree with you; clearly, the function is ambiguous, and perhaps it would be better not to include it.

Thanks, anyway.

p.s. should I close the issue or leave it open?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can leave it open for now so we can continue to consider it for at least a few days. While I like the idea of having this as a constructor on UFloat I think it is too specific and there are too many different things people don't want. I'm more sympathetic to the idea of a module with a variety of useful functions like this one. But there they could be named and documented more specifically e.g. ufloat_from_sample_sem, ufloat_from_sample_std etc.

We could include e.g. those two constructor functions on UFloat. But in any case, my opinion is that this "utility" work should be held off until we are able to complete some of the "core" maintenance targets we have on uncertainties (re-architect the core classes, rework numpy integration, eliminate reliance on confusing monkey patching, etc.). In the lifetime of uncertainties this new group of maintainers is very new, so I think it makes sense to make changes, especially API changes, slowly as we learn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please leave this open

the wiki page mentions the ambiguity:

When only a sample of data from a population is available, the term standard deviation of the sample or sample standard deviation can refer to either the above-mentioned quantity as applied to those data, or to a modified quantity that is an unbiased estimate of the population standard deviation (the standard deviation of the entire population).

so I agree with having explict names like ufloat_from_sample_sem, ufloat_from_sample_std or a required keyword

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewgsavage to be clear, that snippet is from the wikipedia page on standard deviation and is referring to the distinction between the sample standard deviation and an unbiased estimator for the population/distribution standard deviation.

Neither of these are the standard error on the mean, which is the uncertainty of an unbiased estimator of the population mean calculated using the sample mean.

So there are three possible options for what the std_dev attribute on the returned UFloat should be:

  • The sample standard deviation
  • An unbiased estimator for the population/distribution standard deviation calculated using ddof != 0
  • The standard error on the mean

I agree that these could be selected by using different functions or a required keyword argument.


If we want this stuff to be in uncertainties then here's my proposal:

  • We make a new module named utils.py (I don't think it should be named utils.py, we should try to come up with something better, just using that as a placeholder)
  • This module has one function called ufloat_from_sample_std_dev which accepts a ddof argument, matching the numpy API.
  • This module has another function called ufloat_from_sample_std_err. I don't think any other arguments are necessary other than the samples.

I express a preference for following numpy conventions rather than the built-in statistics modules conventions. This is because numpy is better documented than the statistics module. It looks like the statistics module is doing some guessing under-the-hood about what ddof should equal but it's not really explaining it.

These functions should all work as expected when numpy is unavailable. In fact, at first pass, maybe they should be implemented assuming numpy is not available. Then we can do benchmarking to see if numpy makes a measurable performance difference. If so, we can configure the code to use numpy if it is available.

I do NOT think this function should support numpy arrays, at least at first pass. We are still sorting out interoperability with numpy. I think there's a decent chance we'll do something the wrong way and have to redo it if we try to include numpy support for these helper functions. In other words: I could tolerate including helper functions to create UFloat, but not helper functions to create arrays of UFloat or UArray or something.


Other functions can be moved into utils.py (under whatever name it gets).

  • The functions to create sequences of UFloat from nominal values and a covariance or correlation matrix
  • The functions to calculate the covariance or correlation matrix from a sequence of UFloat
  • The ufloat_from_sample_std_dev function
  • The ufloat_from_sample_std_err function
  • The uncertainty-weighted mean calculation in unumpy.average: init #265

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that one downside of this proposal is that it misses the "slick" opportunity to include e.g. UFloat.from_sample_std_dev. But this is ok. At this time, we don't construct UFloats with the class constructor directly anyways, it is done with the ufloat() helper function. In the future we can consider moving away from ufloat and make UFloat the direct constructor. At that time we could consider adding helper methods onto the class for alternative construction methods. But right now, since construction is done through a function, I think it makes sense for alternative construction to also be done through a function.

@newville
Copy link
Member

@jagerber48 @Myles244 @andrewgsavage I have to admit that, even after re-visiting these comments, I am extremely uncomfortable using the standard deviation of the mean here.

If there is a function to convert a set/list/ndarray of values to a ufloat and no other information is provided, I think that it must use the mean and standard deviation of the values supplied. I would need a lot of convincing if ufloat_from_samples([2, 4, 4, 4, 5, 5, 7, 9]) does not give ufloat(5.0, 2.0).

I am OK with having a ddof argument with a default value of 0. The function should be close to:

def ufloat_from_samples(vals, ddof=0):
    if np:
        avals = np.array(vals)
        return ufloat(avals.mean(), avals.std(ddof=ddof))
    else:
        std = statistics.pstdev(vals) if ddof == 0 else statistics.stdev(vals) # only allows ddof=0,1
        return ufloat(statistics.mean(vals), std)

A ufloat represents a mean or most likely value from a distribution and the standard deviation of that distribution so that uncertainties (of the distribution!) can be propagated Indeed, the docstrings and the docs for this code call the second quantity of a ufloat "the standard deviation" throughout, and the variable is called std_dev, an abbreviation of "standard deviation".

I do not see a need for a ufloat constructed from the standard error or the standard deviation from the mean. While the standard deviation of the mean is a useful statistic, it is not used for error propagation.

@jagerber48
Copy link
Contributor

jagerber48 commented Dec 31, 2024

We're all in agreement now that it would not be appropriate for ufloat_from_samples([2, 4, 4, 4, 5, 5, 7, 9]) to return the standard error on the mean. We all agree that if users want the standard error on the mean they either need to use a different, more specifically named function, or they need to pass an obviously named argument. So I don't think you need to worry that ufloat_from_samples, without further qualification, will return the standard error on the mean.

Of course, the option remains to have no function that returns a ufloat whose std_dev is the standard error on the mean. But I say that if we do this then we should also exclude a function to return a ufloat whose std_dev is the sample standard deviation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

new ufloat_from_sample function
4 participants