From 34e6f034502df34cb697cecb2f73f6e790327f3c Mon Sep 17 00:00:00 2001 From: Mirko Bunse Date: Mon, 12 Dec 2022 11:10:16 +0100 Subject: [PATCH] Discuss Holm and Bonferroni in the documentation --- docs/src/index.md | 22 +++++++++++++++++----- docs/src/python-wrapper.md | 4 +++- 2 files changed, 20 insertions(+), 6 deletions(-) diff --git a/docs/src/index.md b/docs/src/index.md index 1da22d9..0bca4f9 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -1,17 +1,25 @@ # [CriticalDifferenceDiagrams.jl](@id Home) -Critical difference (CD) diagrams are a powerful tool to compare outcomes of multiple treatments over multiple observations. For instance, in machine learning research we often compare the performance (outcome) of multiple methods (treatments) over multiple data sets (observations). This Julia package generates Tikz code to produce publication-ready vector graphics. A wrapper for Python is also available. +Critical difference (CD) diagrams are a powerful tool to compare outcomes of multiple treatments over multiple observations. For instance, in machine learning research we often compare the performance (i.e., outcome) of multiple methods (i.e., treatments) over multiple data sets (i.e., observations). This Julia package generates Tikz code to produce publication-ready vector graphics. A [wrapper for Python](python-wrapper/) is also available. ## Reading a CD diagram -Let's take a look at the treatments `clf1` to `clf5`. Their position represents their mean ranks across all outcomes of the observations, where low ranks indicate that a treatment wins more often than its competitors with higher ranks. Two or more treatments are connected with each other if we can not tell their outcomes apart, in the sense of statistical significance. For the above example, we can not tell from the data whether `clf3` and `clf5` are actually different from each other. We can tell, however, that both of them are different from all of the other treatments. This example above is adapted from https://github.com/hfawaz/cd-diagram. +Let's take a look at the treatments `clf1` to `clf5`. Their position represents their mean ranks across all outcomes of the observations, where low ranks indicate that a treatment wins more often than its competitors with higher ranks. Two or more treatments are connected with each other if we can not tell their outcomes apart, in the sense of statistical significance. For the above example, we can not tell from the data whether `clf3` and `clf5` are actually different from each other. We can tell, however, that both of them are different from all of the other treatments. This example above is adapted from [github.com/hfawaz/cd-diagram](https://github.com/hfawaz/cd-diagram). ```@raw html assets/example.svg ``` -A diagram like the one above concisely represents multiple hypothesis tests that are conducted over the observed outcomes. Before anything is plotted at all, the Friedman test tells us whether there are significant differences at all. If this test fails, we have not sufficient data to tell any of the treatments apart and we must abort. If, however, the test sucessfully rejects this possibility we can proceed with the post-hoc analysis. In this second step, a Wilcoxon signed-rank test tells us whether each pair of treatments exhibits a significant difference. Since we are testing multiple hypotheses, we must adjust the Wilcoxon test with Holm's method. For each group of treatments which we can not distinguish from the Holm-adjusted Wilcoxon test, we add a thick line to the diagram. +### Hypothesis testing + +A diagram like the one above concisely represents multiple hypothesis tests that are conducted over the observed outcomes. Before anything is plotted at all, the *Friedman test* tells us whether there are significant differences at all. If this test fails, we have not sufficient data to tell any of the treatments apart and we must abort. If, however, the test sucessfully rejects this possibility we can proceed with the post-hoc analysis. In this second step, a *Wilcoxon signed-rank test* tells us whether each pair of treatments exhibits a significant difference. + +### Multiple testing + +Since we are testing multiple hypotheses, we must *adjust* the Wilcoxon test with Holm's method or with Bonferroni's method. For each group of treatments which we can not distinguish from the Holm-adjusted (or Bonferroni-adjusted) Wilcoxon test, we add a thick line to the diagram. + +Whether we choose Holm's method or Bonferroni's method for the adjustment depends on our personal requirements. Holm's method has the advantage of a greater statistical power than Bonferroni's method, i.e., this adjustment is capable of rejecting more null hypotheses that indeed should be rejected. However, its disadvantage is that the rejection of each null hypothesis depends on the outcome of other null hypotheses. If this property is not desired, one should instead use Bonferroni's method, which ensures that each pair-wise hypothesis test is independent from all others. ## Getting started @@ -37,7 +45,9 @@ plot = CriticalDifferenceDiagrams.plot( :dataset_name, # the name of the observation column :accuracy; # the name of the outcome column maximize_outcome=true, # compute ranks for minimization (default) or maximization - title="CriticalDifferenceDiagrams.jl" # give an optional title + title="CriticalDifferenceDiagrams.jl", # give an optional title + alpha=0.05, # the significance level (default: 0.05) + adjustment=:holm # :holm (default) or :bonferroni ) # configure the preamble of PGFPlots.jl (optional) @@ -52,7 +62,9 @@ PGFPlots.save("example.svg", plot) ## Cautions -The hypothesis tests underneath the CD diagram do not account for variances of the outcomes. It is therefore important that these outcomes are "reliable" in the sense that each of them is obtained from a sufficiently large sample. Ideally, they come from a cross validation or from a repeated stratified split. Moreover, all treatments must have been evaluated on the same set of observations. +The hypothesis tests underneath the CD diagram do not account for variances of the outcomes. It is therefore important that these outcomes are *reliable* in the sense that each of them is obtained from a sufficiently large sample. Ideally, they come from a cross validation or from a repeated stratified split. Moreover, all treatments must have been evaluated on the same set of observations. + +The adjustments by Holm and Bonferroni can lead to different cliques. For more information, see the [Multiple testing](#multiple-testing) section above. ## 2-dimensional sequences of CD diagrams diff --git a/docs/src/python-wrapper.md b/docs/src/python-wrapper.md index a60dee0..3e39c52 100644 --- a/docs/src/python-wrapper.md +++ b/docs/src/python-wrapper.md @@ -31,7 +31,9 @@ plot = cdd.plot( "dataset_name", # the name of the observation column "accuracy", # the name of the outcome column maximize_outcome=True, # compute ranks for minimization (default) or maximization - title="CriticalDifferenceDiagrams.jl" # give an optional title + title="CriticalDifferenceDiagrams.jl", # give an optional title + alpha=0.05, # the significance level (default: 0.05) + adjustment="holm" # "holm" (default) or "bonferroni" ) # configure the preamble of PGFPlots.jl (optional)