Skip to content

Fit a Sample To a Distribution

David Wright edited this page Apr 28, 2018 · 3 revisions

Suppose you have a bunch of numbers (heights, failure times, whatever) that you believe are distributed according to a probability distribution, but you don't know the parameters of that distribution. How do you go about finding those parameters?

Here's one thing not to do (and that I, as a non-statistician scientist, did in the past without knowing better): bin the values into a histogram, estimate the error bar on each bin count as sqrt(n), use the parameterized PDF at each bin mid-point to as a model prediction for the bin count, and do a least-squares fit to find the parameters that best fit the bin counts. Given enough data, this procedure will converge to the correct parameter values, but it is onerous, inefficient, dependent on binning choices, and mis-weights some bins.

The right thing to do is use maximum likelihood estimation to find the fit parameters directly from the sample data, without binning. Meta.Numerics can do this for you.

Let's generate some synthetic Weibull-distributed data to work with:

using System;
using System.Linq;
using System.Collections.Generic;
using Meta.Numerics.Distributions;

Random rng = new Random(7);
WeibullDistribution distribution = new WeibullDistribution(3.0, 1.5);
List<double> sample = distribution.GetRandomValues(rng, 500).ToList();

How do I find the parameters that fit my distribution?

Here's some code that finds the best fit Weibull and lognormal parameters for our data:

using Meta.Numerics.Statistics;

WeibullFitResult weibull = sample.FitToWeibull();
Console.WriteLine($"Best fit scale: {weibull.Scale}");
Console.WriteLine($"Best fit shape: {weibull.Shape}");
Console.WriteLine($"Probability of fit: {weibull.GoodnessOfFit.Probability}");

LognormalFitResult lognormal = sample.FitToLognormal();
Console.WriteLine($"Best fit mu: {lognormal.Mu}");
Console.WriteLine($"Best fit sigma: {lognormal.Sigma}");
Console.WriteLine($"Probability of fit: {lognormal.GoodnessOfFit.Probability}");

Notice that the Weibull fit parameters agree, within uncertainties, with the Weibull parameters which were used to generate the data. Notice also that, even though the Weibull and lognormal distribution shapes are similar, our goodness-of-fit tests indicate that the Weibull fits much better.

For which distributions to dedicated methods exist?

Here is the list as of the writing of this tutorial: Bernoulli, Beta, exponential, Gamma, Gumbel, lognormal, normal, Rayleigh, Wald (aka inverse normal), Weibull.

What if there isn't a dedicated method for my distribution?

No problem. As long as you can write a factory method that, given a parameter dictionary, produces a ContinuousDistribution, Meta.Numerics can do maximum likelihood estimation to find the best-fit parameters. Pretend for a moment that there isn't a dedicated Weibull fit method. Here is the code that you would write to do a Weibull fit:

var result = sample.MaximumLikelihoodFit(parameters => {
        return (new WeibullDistribution(parameters["Scale"], parameters["Shape"]));
    },
        new Dictionary<string, double>() { {"Scale", 1.0}, {"Shape", 1.0}}
    );
foreach(Parameter parameter in result.Parameters) {
    Console.WriteLine($"{parameter.Name} = {parameter.Estimate}");
}

What exactly do the error estimates mean?

They are estimates of the standard deviation of the distribution with which the estimates will vary if the many independent samples of the same size are produced and fit.

What exactly do the goodness-of-fit tests mean?

They are Kolmogorov-Smirnov tests of the tests of the best-fit distribution against the input data. They are not corrected for the fact that distributions were produced from the data, so the P-values are likely to be inflated for small samples. If you get a barely acceptable P-value with a small sample, it's likely that the fitted distribution does not, in fact, perfectly describe your data.

Are the parameter estimates unbiased and efficient?

Since they are maximum likelihood estimates, they are at least asymptotically unbiased and efficient.

In cases where the finite-sample-size bias is known, we typically correct for it in our dedicated methods.

Home

Clone this wiki locally