-
Notifications
You must be signed in to change notification settings - Fork 27
Association
Suppose you have measured values of two variables for each data point, and you want to know whether there is some statistically significant association between the two variables. Does y tend to increase (or decrease) when x increases, or can the observed pattern be attributed to random chance? Tests of association answer this question.
We will work with the following example data:
double[] x = new double[] {-0.58, 0.92, 1.41, 1.62, 2.72, 3.14 };
double[] y = new double[] {1.00, 0.00, 2.00, 16.00, 18.0, 20.0 };
Note that these data points are paired, e.g. the x=1.41 measurement goes with the y=2.00 measurement, even though they are in separate collections.
The Pearson test of linear correlation measures how well a line fits the data.
using System;
using Meta.Numerics.Statistics;
TestResult pearson = Bivariate.PearsonRTest(x, y);
Console.WriteLine($"Pearson {pearson.Statistic.Name} = {pearson.Statistic.Value}");
Console.WriteLine($"{pearson.Type} P = {pearson.Probability}");
If a linear association does exist in the underlying population, the Pearson test is very good at finding it (i.e. the P-value will drop below a critical threshold for a lower sample size than for other tests of association). If an association exists but it is non-linear, the Pearson test will do a less good job at finding it (i.e. give a higher P-value than other tests of association).
The P-value computation for the null hypothesis of no association depends on the assumption that the data are normally distributed. If your data are non-normal and you want a reliable P-value, you should use a different test of association.
The Pearson test is the fastest test of association, executing in O(N) time and requiring no auxiliary memory.
The Spearman test measures the linear correlation of the ranks of the values, rather than the values themselves.
TestResult spearman = Bivariate.SpearmanRhoTest(x, y);
Console.WriteLine($"Spearman {spearman.Statistic.Name} = {spearman.Statistic.Value}");
Console.WriteLine($"{spearman.Type} P = {spearman.Probability}");
Notice that the Spearman (and Kendall) tests do a much better job (i.e. show a much lower P-value) at detecting the highly non-linear association in our example data than the Pearson linear test.
The computation of ρ requires O(N ln N) operations and O(N) auxiliary memory. For small sample sizes, the computation of the null distribution (i.e. P-value) is also quite onerous, but it has the significant advantage over the Pearson test that it does not depend on the distribution of the data (i.e. it is non-parametric).
The Kendall test measures how often a change in one variable is associated with a same-sign (vs. opposite-sign) change in the other.
TestResult kendall = Bivariate.KendallTauTest(x, y);
Console.WriteLine($"Kendall {kendall.Statistic.Name} = {kendall.Statistic.Value}");
Console.WriteLine($"{kendall.Type} P = {kendall.Probability}");
Notice that the Kendall (and Spearman) tests do a much better job (i.e. show a much lower P-value) at detecting the highly non-linear association in our example data than the Pearson linear test.
The computation of τ requires O(N^2) operations. For small sample sizes, the computation of the null distribution (i.e. P-value) is also quite onerous, but it has the significant advantage over the Pearson test that it does not depend on the distribution of the data (i.e. it is non-parametric).
No problem. These API all accept IReadOnlyList<double>, so you can give them arrays, lists, and other data structures such as the frame columns of the Meta.Numerics.Data framework.
(Anscombe's quartet)[https://en.wikipedia.org/wiki/Anscombe%27s_quartet] is a great illustration of the fact that the many possible varieties of association cannot all be captured by a single test statistic.
- Project
- What's New
- Installation
- Versioning
- Tutorials
- Functions
- Compute a Special Function
- Bessel Functions
- Solvers
- Evaluate An Integral
- Find a Maximum or Minimum
- Solve an Equation
- Integrate a Differential Equation
- Data Wrangling
- Statistics
- Analyze a Sample
- Compare Two Samples
- Simple Linear Regression
- Association
- ANOVA
- Contingency Tables
- Multiple Regression
- Logistic Regression
- Cluster and Component Analysis
- Time Series Analysis
- Fit a Sample to a Distribution
- Distributions
- Special Objects
- Linear Algebra
- Polynomials
- Permutations
- Partitions
- Uncertain Values
- Extended Precision
- Functions