diff --git a/docs/articles/Tutorials.html b/docs/articles/Tutorials.html index e54b419..46c3929 100644 --- a/docs/articles/Tutorials.html +++ b/docs/articles/Tutorials.html @@ -83,7 +83,7 @@
We first generate data according to \(\mathbf{X} \sim {MN}_{n\times q}(\boldsymbol{\mu},
+ We first generate data according to \(\mathbf{X} \sim MN_{n\times q}(\boldsymbol{\mu},
\textbf{I}_n, \sigma^2 \textbf{I}_q)\) with \(n=150,q=2,\sigma=1,\) and \[\begin{align}
\label{eq:power_model}
\boldsymbol{\mu}_1 =\ldots = \boldsymbol{\mu}_{50} = \begin{bmatrix}
@@ -95,9 +95,9 @@ Tutorials for k-means clustering inference
\boldsymbol{\mu}_{101}=\ldots = \boldsymbol{\mu}_{150} = \begin{bmatrix}
\delta/2 \\ 0_{q-1}
\end{bmatrix}.
-\end{align}\] Here, we can think of \({C}_1 = \{1,\ldots,50\},{C}_2 =
-\{51,\ldots,100\},{C}_3 = \{101,\ldots,150\}\) as the “true
-clusters”. In the figure below, we display one such simulation \(\mathbf{x}\in\mathbb{R}^{100\times 2}\)
+\end{align}\]
set.seed(2022)
@@ -187,9 +187,8 @@ Inference
#> cluster_1 cluster_2 test_stat p_selective p_naive
#> 1 2 3 4.464756 8.514513e-29 2.171388e-110
In the summary, we have the empirical difference in means of the
-second feature between the two clusters, i.e.,\(\sum_{i\in
-{\hat{{G}}}}\mathbf{x}_{i,2}/|\hat{{G}}| - \sum_{i\in
-\hat{G}'}\mathbf{x}_{i,2}/|\hat{G}'|\)
+second feature between the two clusters, i.e.,\(\sum_{i\in \hat{G}}\mathbf{x}_{i,2}/|\hat{{G}}| -
+\sum_{i\in \hat{G}'}\mathbf{x}_{i,2}/|\hat{G}'|\)
(test_stats
), the naive p-value based on a z-test
(p_naive
), and the selective \(p\)-value (p_selective
). In
this case, the test based on \(p_{\text{selective}}\) can reject this null
@@ -221,13 +220,12 @@
By inspection, we see that the blue clusters (labeled as cluster 1) -and the grey clusters (labeled as cluster 4) have the same mean. Now -\(p_{\text{selective}}\) yields a much +and the grey clusters (labeled as cluster 4) have the same mean. Now the +selective \(p\)-value yields a much more moderate \(p\)-value, and the test -based on \(p_{2,\text{selective}}\) -cannot reject the null hypothesis when it holds. By contrast, the naive -\(p\)-value is tiny and leads to an -anti-conservative test.
+based on it cannot reject the null hypothesis when it holds. By +contrast, the naive \(p\)-value is tiny +and leads to an anti-conservative test.
cluster_1 <- 1
cluster_2 <- 4
diff --git a/docs/articles/Tutorials_hier.html b/docs/articles/Tutorials_hier.html
index 59677d3..4ac46b2 100644
--- a/docs/articles/Tutorials_hier.html
+++ b/docs/articles/Tutorials_hier.html
@@ -156,10 +156,10 @@ Infer
cluster_1 <- 1
cluster_2 <- 3
-cl_1_2_inference_demo <- test_hier_clusters_exact_1f(X=X, link="average", hcl=hcl, K=3, k1=1, k2=2, feat=1)
-summary(cl_1_2_inference_demo)
+cl_inference_demo <- test_hier_clusters_exact_1f(X=X, link="average", hcl=hcl, K=3, k1=cluster_1, k2=cluster_2, feat=1)
+summary(cl_inference_demo)
#> cluster_1 cluster_2 test_stat p_selective p_naive
-#> 1 1 2 4.464756 8.870985e-08 1.774197e-07
+#> 1 1 3 9.910708 2.868783e-26 4.596766e-31
In the summary, we have the empirical difference in means of the
first feature between the two clusters, i.e.,\(\sum_{i\in
{\hat{{G}}}}\mathbf{x}_{i,2}/|\hat{{G}}| - \sum_{i\in
@@ -179,15 +179,17 @@ \(p_{2,\text{selective}}\)
-yields a much more moderate \(p\)-value, and the test based on \(p_{2,\text{selective}}\) cannot reject the
-null hypothesis when it holds. By contrast, the naive \(p\)-value is tiny and leads to an
-anti-conservative test.
Now the selective \(p\)-value yields +a much more moderate \(p\)-value, and +the test based on it cannot reject the null hypothesis when it holds. By +contrast, the naive \(p\)-value is tiny +and leads to an anti-conservative test.
cluster_1 <- 1
-cluster_2 <- 4
-cl_1_2_inference_demo <- test_hier_clusters_exact_1f(X=X, link="average", hcl=hcl, K=3, k1=1, k2=3, feat=2)
-summary(cl_1_2_inference_demo)
+cluster_2 <- 3
+cl_inference_demo <- test_hier_clusters_exact_1f(X=X, link="average",
+ hcl=hcl, K=3, k1=cluster_1, k2=cluster_2, feat=2)
+summary(cl_inference_demo)
#> cluster_1 cluster_2 test_stat p_selective p_naive
#> 1 1 3 -0.1766818 0.8362984 0.8362984
We show that this \(p\)-value for testing \(\hat{H}_{0j}\) can be written @@ -169,21 +169,22 @@
While the notation in the last paragraph might seem daunting, the intuition is simple: since \(p_{j, \text{selective}}\) can be rewritten into sums of CDFs of diff --git a/docs/pkgdown.yml b/docs/pkgdown.yml index 7ba3443..dc3b4ec 100644 --- a/docs/pkgdown.yml +++ b/docs/pkgdown.yml @@ -5,7 +5,7 @@ articles: Tutorials: Tutorials.html Tutorials_hier: Tutorials_hier.html technical_details: technical_details.html -last_built: 2023-11-26T01:22Z +last_built: 2023-11-26T01:39Z urls: reference: https://yiqunchen.github.io/CADET/reference article: https://yiqunchen.github.io/CADET/articles diff --git a/vignettes/Tutorials.Rmd b/vignettes/Tutorials.Rmd index 273aaca..52b9c6c 100644 --- a/vignettes/Tutorials.Rmd +++ b/vignettes/Tutorials.Rmd @@ -23,7 +23,7 @@ library(CADET) library(ggplot2) ``` -We first generate data according to $\mathbf{X} \sim {MN}_{n\times q}(\boldsymbol{\mu}, \textbf{I}_n, \sigma^2 \textbf{I}_q)$ with $n=150,q=2,\sigma=1,$ and +We first generate data according to $\mathbf{X} \sim MN_{n\times q}(\boldsymbol{\mu}, \textbf{I}_n, \sigma^2 \textbf{I}_q)$ with $n=150,q=2,\sigma=1,$ and \begin{align} \label{eq:power_model} \boldsymbol{\mu}_1 =\ldots = \boldsymbol{\mu}_{50} = \begin{bmatrix} @@ -35,7 +35,7 @@ We first generate data according to $\mathbf{X} \sim {MN}_{n\times q}(\boldsymbo \delta/2 \\ 0_{q-1} \end{bmatrix}. \end{align} -Here, we can think of ${C}_1 = \{1,\ldots,50\},{C}_2 = \{51,\ldots,100\},{C}_3 = \{101,\ldots,150\}$ as the "true clusters". +Here, we can think of $C_1 = \{1,\ldots,50\},C_2 = \{51,\ldots,100\},C_3 = \{101,\ldots,150\}$ as the "true clusters". In the figure below, we display one such simulation $\mathbf{x}\in\mathbb{R}^{100\times 2}$ with $\delta=10$. ```{r fig.align="center", fig.height = 5, fig.width = 5} @@ -98,7 +98,7 @@ cl_inference_demo <- kmeans_inference_1f(X, k=3, cluster_1, cluster_2, summary(cl_inference_demo) ``` -In the summary, we have the empirical difference in means of the second feature between the two clusters, i.e.,$\sum_{i\in {\hat{{G}}}}\mathbf{x}_{i,2}/|\hat{{G}}| - \sum_{i\in \hat{G}'}\mathbf{x}_{i,2}/|\hat{G}'|$ (`test_stats`), the naive p-value based on a z-test (`p_naive`), and the selective $p$-value (`p_selective`). In this case, the test based on $p_{\text{selective}}$ can reject this null hypothesis that the blue and pink clusters have the same mean in the first feature ($p_{2,\text{selective}}<0.001$). +In the summary, we have the empirical difference in means of the second feature between the two clusters, i.e.,$\sum_{i\in \hat{G}}\mathbf{x}_{i,2}/|\hat{{G}}| - \sum_{i\in \hat{G}'}\mathbf{x}_{i,2}/|\hat{G}'|$ (`test_stats`), the naive p-value based on a z-test (`p_naive`), and the selective $p$-value (`p_selective`). In this case, the test based on $p_{\text{selective}}$ can reject this null hypothesis that the blue and pink clusters have the same mean in the first feature ($p_{2,\text{selective}}<0.001$). ### Inference for k-means clustering when the null hypothesis holds diff --git a/vignettes/technical_details.Rmd b/vignettes/technical_details.Rmd index c73ae4d..cf1a0dd 100644 --- a/vignettes/technical_details.Rmd +++ b/vignettes/technical_details.Rmd @@ -19,7 +19,7 @@ knitr::opts_chunk$set(