Skip to content

Association

David Wright edited this page Apr 16, 2018 · 3 revisions

Suppose you have measured values of two variables for each data point, and you want to know whether there is some statistically significant association between the two variables. Does y tend to increase (or decrease) when x increases, or can the observed pattern be attributed to random chance? Tests of association answer this question.

We will work with the following example data:

double[] x = new double[] {-0.58, 0.92, 1.41, 1.62, 2.72, 3.14 };
double[] y = new double[] {1.00, 0.00, 2.00, 16.00, 18.0, 20.0 };

Note that these data points are paired, e.g. the x=1.41 measurement goes with the y=2.00 measurement, even though they are in separate collections.

Pearson (Linear) Correlation

The Pearson test of linear correlation measures how well a line fits the data.

using System;
using Meta.Numerics.Statistics;

TestResult pearson = Bivariate.PearsonRTest(x, y);
Console.WriteLine($"Pearson {pearson.Statistic.Name} = {pearson.Statistic.Value}");
Console.WriteLine($"{pearson.Type} P = {pearson.Probability}");

If a linear association does exist in the underlying population, the Pearson test is very good at finding it (i.e. the P-value will drop below a critical threshold for a lower sample size than for other tests of association). If an association exists but it is non-linear, the Pearson test will do a less good job at finding it (i.e. give a higher P-value than other tests of association).

The P-value computation for the null hypothesis of no association depends on the assumption that the data are normally distributed. If your data are non-normal and you want a reliable P-value, you should use a different test of association.

The Pearson test is the fastest test of association, executing in O(N) time and requiring no auxiliary memory.

Spearman (Rank-Order) Correlation

The Spearman test measures the linear correlation of the ranks of the values, rather than the values themselves.

TestResult spearman = Bivariate.SpearmanRhoTest(x, y);
Console.WriteLine($"Spearman {spearman.Statistic.Name} = {spearman.Statistic.Value}");
Console.WriteLine($"{spearman.Type} P = {spearman.Probability}");

Notice that the Spearman (and Kendall) tests do a much better job (i.e. show a much lower P-value) at detecting the highly non-linear association in our example data than the Pearson linear test.

The computation of ρ requires O(N ln N) operations and O(N) auxiliary memory. For small sample sizes, the computation of the null distribution (i.e. P-value) is also quite onerous, but it has the significant advantage over the Pearson test that it does not depend on the distribution of the data (i.e. it is non-parametric).

Kendall (Concordance) Association

The Kendall test measures how often a change in one variable is associated with a same-sign (vs. opposite-sign) change in the other.

TestResult kendall = Bivariate.KendallTauTest(x, y);
Console.WriteLine($"Kendall {kendall.Statistic.Name} = {kendall.Statistic.Value}");
Console.WriteLine($"{kendall.Type} P = {kendall.Probability}");

Notice that the Kendall (and Spearman) tests do a much better job (i.e. show a much lower P-value) at detecting the highly non-linear association in our example data than the Pearson linear test.

The computation of τ requires O(N^2) operations. For small sample sizes, the computation of the null distribution (i.e. P-value) is also quite onerous, but it has the significant advantage over the Pearson test that it does not depend on the distribution of the data (i.e. it is non-parametric).

What if my data are not in arrays?

No problem. These API all accept IReadOnlyList<double>, so you can give them arrays, lists, and other data structures such as the frame columns of the Meta.Numerics.Data framework.

Remind me again to graph my data

(Anscombe's quartet)[https://en.wikipedia.org/wiki/Anscombe%27s_quartet] is a great illustration of the fact that the many possible varieties of association cannot all be captured by a single test statistic.

Home

Clone this wiki locally