Added ufloat_from_sample function #277

Myles244 · 2024-12-16T18:52:53Z

Closes new ufloat_from_sample function #276
Added an entry to the CHANGES file
Updated numpy_guide.rst

I'm unsure how to write or run tests, so if someone could point me in the direction of a good resource, it would be much appreciated.

andrewgsavage · 2024-12-17T23:09:14Z

uncertainties uses pytest.

pip install ".[test]"      # to enable running the tests
pytest .

will run the tests

see https://docs.pytest.org/en/stable/getting-started.html for a resource

you'll want to add a test/tests similar to

uncertainties/tests/test_unumpy.py

Line 277 in 969324d

def test_component_extraction():

testing some example calculations, using assert result_of_your_function == expected_result

Myles244 · 2024-12-19T23:39:04Z

Thank you.
I have now completed the following:

checked that it still passes all tests.
written an appropriate test for this new function.

codecov · 2024-12-20T00:49:56Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.52%. Comparing base (969324d) to head (82842e9).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #277      +/-   ##
==========================================
+ Coverage   96.50%   96.52%   +0.01%     
==========================================
  Files          16       16              
  Lines        1919     1926       +7     
==========================================
+ Hits         1852     1859       +7     
  Misses         67       67

Flag	Coverage Δ
macos-latest-3.10	`94.96% <100.00%> (+0.01%)`	⬆️
macos-latest-3.11	`94.96% <100.00%> (+0.01%)`	⬆️
macos-latest-3.12	`94.96% <100.00%> (+0.01%)`	⬆️
macos-latest-3.8	`94.95% <100.00%> (+0.01%)`	⬆️
macos-latest-3.9	`94.95% <100.00%> (+0.01%)`	⬆️
no-numpy	`74.50% <0.00%> (-0.28%)`	⬇️
ubuntu-latest-3.10	`94.96% <100.00%> (+0.01%)`	⬆️
ubuntu-latest-3.11	`94.96% <100.00%> (+0.01%)`	⬆️
ubuntu-latest-3.12	`94.96% <100.00%> (+0.01%)`	⬆️
ubuntu-latest-3.8	`94.95% <100.00%> (+0.01%)`	⬆️
ubuntu-latest-3.9	`94.95% <100.00%> (+0.01%)`	⬆️
windows-latest-3.10	`94.96% <100.00%> (+0.01%)`	⬆️
windows-latest-3.11	`94.96% <100.00%> (+0.01%)`	⬆️
windows-latest-3.12	`94.96% <100.00%> (+0.01%)`	⬆️
windows-latest-3.8	`94.95% <100.00%> (+0.01%)`	⬆️
windows-latest-3.9	`94.95% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jagerber48 · 2024-12-20T03:42:45Z

See my comments at #276. Not sure if the discussion should continue here or there. I'm initially wary about adding this in. I see the utility, but I wonder if uncertainties wants to get in the business of adding small helper functions like this or not. Also it needs to be made clear (via function names or documentation) that this function should only really be used with data that are known to be normally or approximately normally distributed.

This reverts commit 82842e9.

This reverts commit 314c32d.

This reverts commit 86f8f7d.

This reverts commit 4e161a9.

This reverts commit 4fba174.

newville

I like this, and the use of the builtin statistics module when Numpy is not available.

Following the precedent from that statistics package, I recommend
ufloat_from_samples over ufloat_from_sample.
See https://docs.python.org/3/library/statistics.html#statistics.NormalDist.from_samples

I would say this OK in core.py.... we could consider moving it somewhere if needed.

newville · 2024-12-22T14:15:41Z

uncertainties/core.py

+                    otypes=[object],
+                    )(mean_value, error_on_mean)
+    else:
+        msg={


Maybe just a f-string:

msg = f"method='{method}' is not implemented"

newville · 2024-12-22T14:37:40Z

uncertainties/core.py

+        if numpy is None:
+            #if numpy is not present, use pythons statistics functions instead
+            mean_value=stats_mean(sample)
+            error_on_mean=stats_stdev(sample)/sqrt(len(sample)-1)


I believe the statistics.stdev already divides by sqrt(len(samples)-1), and it should not be done again here.

But also, I think we will need to discuss/document this, and perhaps have an option for using the "-1" ("Bessel's -1"), perhaps adding a ddof argument (default=0). I think the default should be "match numpy.std()", which divides variance by sqrt(len(samples)).

Yes, you are right, i mistakenly assumed that statistics.stdev behaved the same as numpt.std does without any optional arguments.

Adding ddof=0 as an optional parameter should be simple. For the non-numpy code I can use statistics.pstdev and then include "-ddof" in the denominator.

newville · 2024-12-22T14:37:43Z

uncertainties/core.py

+            else:
+                sample_size=numpy.shape(sample)[axis]
+
+            error_on_mean=numpy.std(sample, ddof=1, axis=axis)/numpy.sqrt(sample_size)


I think the plain ndarray.std() should be used by default, and ddof should default to 1, and that dividing by the sample size here is not correct.

Sorry, I don't understand what you mean. The error on the mean is the variance of the sample divided by the square root of the sample size.

Sorry, I meant ddof should default to 0, as in numpy.std().

Dividing by sqrt(len(samples)) gives the standard deviation of the mean, not the standard deviation of the value.
The standard deviation of the mean tells you how much you expect the mean value to change if you select a different set of samples/measurements from a larger pool of values.

For propagating uncertainties, you (usually, normally?) want the standard deviation of the value itself.

I think two good rule-of-thumbs are "what does Wikipedia say?" and "what does numpy do?".
Taking the basic example at https://en.wikipedia.org/wiki/Standard_deviation#Basic_examples:

>>> import numpy as np >>> samples = [2, 4, 4, 4, 5, 5, 7, 9] >>> print(np.mean(samples), np.std(samples)) 5.0 2.0 >>> print(np.mean(samples), np.std(samples), ddof=1) 5.0 2.13808993529939

which agrees with Wikipedia. I would be reluctant to have

>>> ufloat_from_samples([2, 4, 4, 4, 5, 5, 7, 9])

return anything except ufloat(5.0, 2.0) - that would require extra explanation and documentation.

An option to divide by sqrt(len(samples)-ddof) would be OK.

Thank you, I understand now.

I originally intended the function to return the mean and standard error on the mean. This Is because if I am measuring a constant value where my measurement apparatus have some random error, then my measurements are a sample from a normal distribution. Where the best estimate for the value is the mean of the sample and the uncertainty is the uncertainty on the mean.

Perhaps it could be a method "from measurement". I think this would be more clear anyway.

@Myles244 if your samples are from a normal distribution, and you are measuring noisy values from an experimental measurement, then I think you do want to use the mean and the standard deviation (as from numpy.mean() and numpy.std()). Those give the average (and most likely, for normal distributions) and the appropriate measure of the variation observed.

The standard deviation of the mean is a useful quantity, but I think it is not what you want here. Let's take mass of Christmas trees (and just to keep things jolly, let's pretend those have Gaussian distributions), and the total sample size in the survey is 1 million trees. A random sample of 100 would be small, but give a mean and standard deviation, and not be terrible. If you select a different 100 samples, you would have a mean and standard deviation that was pretty consistent with the first sample of 100. If you increase the sample size to 1000, you would not expect the mean or the standard deviation to change by much from the sample of 100. But, if you took a second sample of 1000, you would expect that mean value to be closer to the mean for the first 1000 than was the case for the two samples of 100.

Another classic way to think about it is with test scores across many classrooms: give the same test to all students in 10 classrooms of 25 students each. The standard deviation of the 250 scores tells you the variation between students. The standard deviation of the mean of each classroom tells you the variation between classrooms. One is a measure of the students; the other is a measure of the teachers (or a bias in how students are assigned classrooms, so the administration, perhaps) ;).

If you are measuring voltages and see a distribution [2, 4, 4, 4, 5, 5, 7, 9], the standard deviation is 2. If you repeat the measurements 100 times and the values are [[2, 4, 4, 4, 5, 5, 7, 9]*100], the standard deviation is still 2.

OTOH, if those are 100 different samples of a set of 8 different, possibly related, voltages, then the standard deviation of the mean of those 100 observations of 8 voltages is now much smaller than 2 -- the mean value is very consistent (across those 100 "classrooms"). But to know the difference, I think you would need to provide the 100 mean and standard deviations.

I think the most natural and common use case would be to have a bunch of samples of a single quantity, in which case the standard deviation is wanted. An option to divide by sqrt(len(samples)-ddof) would be OK.

@jagerber48 I think I have to agree with you; clearly, the function is ambiguous, and perhaps it would be better not to include it.

Thanks, anyway.

p.s. should I close the issue or leave it open?

I think you can leave it open for now so we can continue to consider it for at least a few days. While I like the idea of having this as a constructor on UFloat I think it is too specific and there are too many different things people don't want. I'm more sympathetic to the idea of a module with a variety of useful functions like this one. But there they could be named and documented more specifically e.g. ufloat_from_sample_sem, ufloat_from_sample_std etc.

We could include e.g. those two constructor functions on UFloat. But in any case, my opinion is that this "utility" work should be held off until we are able to complete some of the "core" maintenance targets we have on uncertainties (re-architect the core classes, rework numpy integration, eliminate reliance on confusing monkey patching, etc.). In the lifetime of uncertainties this new group of maintainers is very new, so I think it makes sense to make changes, especially API changes, slowly as we learn.

yes please leave this open

the wiki page mentions the ambiguity:

When only a sample of data from a population is available, the term standard deviation of the sample or sample standard deviation can refer to either the above-mentioned quantity as applied to those data, or to a modified quantity that is an unbiased estimate of the population standard deviation (the standard deviation of the entire population).

so I agree with having explict names like ufloat_from_sample_sem, ufloat_from_sample_std or a required keyword

@andrewgsavage to be clear, that snippet is from the wikipedia page on standard deviation and is referring to the distinction between the sample standard deviation and an unbiased estimator for the population/distribution standard deviation.

Neither of these are the standard error on the mean, which is the uncertainty of an unbiased estimator of the population mean calculated using the sample mean.

So there are three possible options for what the std_dev attribute on the returned UFloat should be:

The sample standard deviation

An unbiased estimator for the population/distribution standard deviation calculated using ddof != 0

The standard error on the mean

I agree that these could be selected by using different functions or a required keyword argument.

If we want this stuff to be in uncertainties then here's my proposal:

We make a new module named utils.py (I don't think it should be named utils.py, we should try to come up with something better, just using that as a placeholder)

This module has one function called ufloat_from_sample_std_dev which accepts a ddof argument, matching the numpy API.

This module has another function called ufloat_from_sample_std_err. I don't think any other arguments are necessary other than the samples.

I express a preference for following numpy conventions rather than the built-in statistics modules conventions. This is because numpy is better documented than the statistics module. It looks like the statistics module is doing some guessing under-the-hood about what ddof should equal but it's not really explaining it.

These functions should all work as expected when numpy is unavailable. In fact, at first pass, maybe they should be implemented assuming numpy is not available. Then we can do benchmarking to see if numpy makes a measurable performance difference. If so, we can configure the code to use numpy if it is available.

I do NOT think this function should support numpy arrays, at least at first pass. We are still sorting out interoperability with numpy. I think there's a decent chance we'll do something the wrong way and have to redo it if we try to include numpy support for these helper functions. In other words: I could tolerate including helper functions to create UFloat, but not helper functions to create arrays of UFloat or UArray or something.

Other functions can be moved into utils.py (under whatever name it gets).

The functions to create sequences of UFloat from nominal values and a covariance or correlation matrix

The functions to calculate the covariance or correlation matrix from a sequence of UFloat

The ufloat_from_sample_std_dev function

The ufloat_from_sample_std_err function

The uncertainty-weighted mean calculation in unumpy.average: init #265

Note that one downside of this proposal is that it misses the "slick" opportunity to include e.g. UFloat.from_sample_std_dev. But this is ok. At this time, we don't construct UFloats with the class constructor directly anyways, it is done with the ufloat() helper function. In the future we can consider moving away from ufloat and make UFloat the direct constructor. At that time we could consider adding helper methods onto the class for alternative construction methods. But right now, since construction is done through a function, I think it makes sense for alternative construction to also be done through a function.

newville · 2024-12-31T16:50:13Z

@jagerber48 @Myles244 @andrewgsavage I have to admit that, even after re-visiting these comments, I am extremely uncomfortable using the standard deviation of the mean here.

If there is a function to convert a set/list/ndarray of values to a ufloat and no other information is provided, I think that it must use the mean and standard deviation of the values supplied. I would need a lot of convincing if ufloat_from_samples([2, 4, 4, 4, 5, 5, 7, 9]) does not give ufloat(5.0, 2.0).

I am OK with having a ddof argument with a default value of 0. The function should be close to:

def ufloat_from_samples(vals, ddof=0):
    if np:
        avals = np.array(vals)
        return ufloat(avals.mean(), avals.std(ddof=ddof))
    else:
        std = statistics.pstdev(vals) if ddof == 0 else statistics.stdev(vals) # only allows ddof=0,1
        return ufloat(statistics.mean(vals), std)

A ufloat represents a mean or most likely value from a distribution and the standard deviation of that distribution so that uncertainties (of the distribution!) can be propagated Indeed, the docstrings and the docs for this code call the second quantity of a ufloat "the standard deviation" throughout, and the variable is called std_dev, an abbreviation of "standard deviation".

I do not see a need for a ufloat constructed from the standard error or the standard deviation from the mean. While the standard deviation of the mean is a useful statistic, it is not used for error propagation.

jagerber48 · 2024-12-31T19:41:21Z

We're all in agreement now that it would not be appropriate for ufloat_from_samples([2, 4, 4, 4, 5, 5, 7, 9]) to return the standard error on the mean. We all agree that if users want the standard error on the mean they either need to use a different, more specifically named function, or they need to pass an obviously named argument. So I don't think you need to worry that ufloat_from_samples, without further qualification, will return the standard error on the mean.

Of course, the option remains to have no function that returns a ufloat whose std_dev is the standard error on the mean. But I say that if we do this then we should also exclude a function to return a ufloat whose std_dev is the sample standard deviation.

Myles244 added 3 commits December 16, 2024 17:38

Update core.py

314c32d

Update CHANGES.rst

f9ed2db

Update numpy_guide.rst

4fba174

Update test_unumpy.py

86f8f7d

Update numpy_guide.rst

4e161a9

Update core.py

82842e9

Myles244 added 15 commits December 21, 2024 17:07

Revert "Update core.py"

98bb34a

This reverts commit 82842e9.

Revert "Update core.py"

6274859

This reverts commit 314c32d.

Update core.py

ebe0591

Revert "Update test_unumpy.py"

58a1220

This reverts commit 86f8f7d.

Update core.py

e864bd0

Update core.py

aa4852a

Update core.py

779459b

Update test_uncertainties.py

30a52f1

Revert "Update numpy_guide.rst"

dac42d1

This reverts commit 4e161a9.

Revert "Update numpy_guide.rst"

bf49d94

This reverts commit 4fba174.

Update CHANGES.rst

205b913

Update user_guide.rst

2951fcf

Update user_guide.rst

e9ff108

Update core.py

3c250d4

Update core.py

2d592d6

newville reviewed Dec 22, 2024

View reviewed changes

This was referenced Dec 30, 2024

Split "utility" functions from core functionality #282

Open

unumpy.average: init #265

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added ufloat_from_sample function #277

Added ufloat_from_sample function #277

Myles244 commented Dec 16, 2024 •

edited

Loading

andrewgsavage commented Dec 17, 2024

Myles244 commented Dec 19, 2024

codecov bot commented Dec 20, 2024 •

edited

Loading

jagerber48 commented Dec 20, 2024

newville left a comment

newville Dec 22, 2024

newville Dec 22, 2024

Myles244 Dec 22, 2024

newville Dec 22, 2024

Myles244 Dec 22, 2024

newville Dec 22, 2024

Myles244 Dec 22, 2024 •

edited

Loading

newville Dec 22, 2024

Myles244 Dec 23, 2024 •

edited

Loading

jagerber48 Dec 24, 2024

andrewgsavage Dec 26, 2024

jagerber48 Dec 26, 2024

jagerber48 Dec 29, 2024

newville commented Dec 31, 2024

jagerber48 commented Dec 31, 2024 •

edited

Loading

Added ufloat_from_sample function #277

Are you sure you want to change the base?

Added ufloat_from_sample function #277

Conversation

Myles244 commented Dec 16, 2024 • edited Loading

andrewgsavage commented Dec 17, 2024

Myles244 commented Dec 19, 2024

codecov bot commented Dec 20, 2024 • edited Loading

Codecov Report

jagerber48 commented Dec 20, 2024

newville left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Myles244 Dec 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Myles244 Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

newville commented Dec 31, 2024

jagerber48 commented Dec 31, 2024 • edited Loading

Myles244 commented Dec 16, 2024 •

edited

Loading

codecov bot commented Dec 20, 2024 •

edited

Loading

Myles244 Dec 22, 2024 •

edited

Loading

Myles244 Dec 23, 2024 •

edited

Loading

jagerber48 commented Dec 31, 2024 •

edited

Loading