Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variance calculation gives biased results for samples #2

Open
ORBAT opened this issue Sep 7, 2015 · 3 comments
Open

Variance calculation gives biased results for samples #2

ORBAT opened this issue Sep 7, 2015 · 3 comments

Comments

@ORBAT
Copy link

ORBAT commented Sep 7, 2015

The current method of calculating variance (and, by extension, standard deviation) is intended for sets that form the whole population. When dealing with a sample, i.e. you pick n elements out of k and you don't know the mean of the whole population, you need to apply Bessel's correction and divide by n-1 instead of n when taking the mean.

@ORBAT ORBAT changed the title Variance and standard deviation for samples Variance calculation gives biased results for samples Sep 7, 2015
@brycebaril
Copy link
Owner

I wasn't familiar with Bessel's correction, though reading the wikipedia article I saw:

  1. it looks like it is only applicable for a subset of variance calculations (i.e. dealing with samples) which isn't necessarily true
  2. it comes with three caveats

How do you suggest we handle this? By adding additional methods for subsets, or perhaps by creating a subset-only version of this library?

@ORBAT
Copy link
Author

ORBAT commented Sep 8, 2015

Yeah, you only need to apply the correction if you're dealing with a sample out of a larger population and you don't know the mean.

One of the caveats is that Bessel's correction will give you an unbiased variance when you have samples, but it won't give you an unbiased standard deviation: there is no general method for calculating an unbiased sd in the first place. It does, however, correct some of the bias. There's also the question of which correction factor to use, but n-1 is good enough for most cases (and if someone needs something more sophisticated, it'll probably fall out of scope for stats-lite anyhow.)

A simple, backwards-compatible way of implementing this could be to have variance and stdev take an optional parameter sample (or bessel or whatever):

// Variance = average squared deviation from mean.
// If sample is true, vals represents a sample of a population, so Bessel's correction will be applied 
function variance(vals, sample) {
  vals = numbers(vals)
  var avg = mean(vals)
  var diffs = []
  for (var i = 0; i < vals.length; i++) {
    diffs.push(Math.pow((vals[i] - avg), 2))
  }
  var res = mean(diffs);
  if(sample) {
    res *= vals.length / (vals.length - 1);
  }
  return res;
}

// Standard Deviation = sqrt of variance.
// If sample is true, vals represents a sample of a population, so Bessel's correction will be applied
function stdev(vals, sample) {
  return Math.sqrt(variance(vals, sample))
}

@brycebaril
Copy link
Owner

Usually not a huge fan of polymorphic functions in Node where optimization matters due to the way V8 deoptimizes them.

That said I don't know how much of a concern it is in this case because in the same application the code would have to call it like variance(vals) and variance(vals, true) to cause a deopt. I don't know how likely that is to happen, and then that user could avoid the penalties by calling variance(vals, false) in the first case...

Will think about it.

In other news I just published v2.0.0 of this module with support for multi-modal mode distributions, but at the same time made it Node.js v4.0.0+ (for ES6 Sets) so that might impact your ability to immediately use a modified variance function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants