diff --git a/docs/articles/Tutorials.html b/docs/articles/Tutorials.html index e54b419..46c3929 100644 --- a/docs/articles/Tutorials.html +++ b/docs/articles/Tutorials.html @@ -83,7 +83,7 @@

Tutorials for k-means clustering inference

 library(CADET)
 library(ggplot2)
-

We first generate data according to \(\mathbf{X} \sim {MN}_{n\times q}(\boldsymbol{\mu}, +

We first generate data according to \(\mathbf{X} \sim MN_{n\times q}(\boldsymbol{\mu}, \textbf{I}_n, \sigma^2 \textbf{I}_q)\) with \(n=150,q=2,\sigma=1,\) and \[\begin{align} \label{eq:power_model} \boldsymbol{\mu}_1 =\ldots = \boldsymbol{\mu}_{50} = \begin{bmatrix} @@ -95,9 +95,9 @@

Tutorials for k-means clustering inference

\boldsymbol{\mu}_{101}=\ldots = \boldsymbol{\mu}_{150} = \begin{bmatrix} \delta/2 \\ 0_{q-1} \end{bmatrix}. -\end{align}\] Here, we can think of \({C}_1 = \{1,\ldots,50\},{C}_2 = -\{51,\ldots,100\},{C}_3 = \{101,\ldots,150\}\) as the “true -clusters”. In the figure below, we display one such simulation \(\mathbf{x}\in\mathbb{R}^{100\times 2}\) +\end{align}\] Here, we can think of \(C_1 = \{1,\ldots,50\},C_2 = \{51,\ldots,100\},C_3 += \{101,\ldots,150\}\) as the “true clusters”. In the figure +below, we display one such simulation \(\mathbf{x}\in\mathbb{R}^{100\times 2}\) with \(\delta=10\).

 set.seed(2022)
@@ -187,9 +187,8 @@ 

Inference #> cluster_1 cluster_2 test_stat p_selective p_naive #> 1 2 3 4.464756 8.514513e-29 2.171388e-110

In the summary, we have the empirical difference in means of the -second feature between the two clusters, i.e.,\(\sum_{i\in -{\hat{{G}}}}\mathbf{x}_{i,2}/|\hat{{G}}| - \sum_{i\in -\hat{G}'}\mathbf{x}_{i,2}/|\hat{G}'|\) +second feature between the two clusters, i.e.,\(\sum_{i\in \hat{G}}\mathbf{x}_{i,2}/|\hat{{G}}| - +\sum_{i\in \hat{G}'}\mathbf{x}_{i,2}/|\hat{G}'|\) (test_stats), the naive p-value based on a z-test (p_naive), and the selective \(p\)-value (p_selective). In this case, the test based on \(p_{\text{selective}}\) can reject this null @@ -221,13 +220,12 @@

Inferen #> 2 0 0 50 0 #> 3 25 0 0 25

By inspection, we see that the blue clusters (labeled as cluster 1) -and the grey clusters (labeled as cluster 4) have the same mean. Now -\(p_{\text{selective}}\) yields a much +and the grey clusters (labeled as cluster 4) have the same mean. Now the +selective \(p\)-value yields a much more moderate \(p\)-value, and the test -based on \(p_{2,\text{selective}}\) -cannot reject the null hypothesis when it holds. By contrast, the naive -\(p\)-value is tiny and leads to an -anti-conservative test.

+based on it cannot reject the null hypothesis when it holds. By +contrast, the naive \(p\)-value is tiny +and leads to an anti-conservative test.

 cluster_1 <- 1
 cluster_2 <- 4
diff --git a/docs/articles/Tutorials_hier.html b/docs/articles/Tutorials_hier.html
index 59677d3..4ac46b2 100644
--- a/docs/articles/Tutorials_hier.html
+++ b/docs/articles/Tutorials_hier.html
@@ -156,10 +156,10 @@ 

Infer
 cluster_1 <- 1
 cluster_2 <- 3
-cl_1_2_inference_demo <- test_hier_clusters_exact_1f(X=X, link="average", hcl=hcl, K=3, k1=1, k2=2, feat=1)
-summary(cl_1_2_inference_demo)
+cl_inference_demo <- test_hier_clusters_exact_1f(X=X, link="average", hcl=hcl, K=3, k1=cluster_1, k2=cluster_2, feat=1)
+summary(cl_inference_demo)
 #>   cluster_1 cluster_2 test_stat  p_selective      p_naive
-#> 1         1         2  4.464756 8.870985e-08 1.774197e-07
+#> 1 1 3 9.910708 2.868783e-26 4.596766e-31

In the summary, we have the empirical difference in means of the first feature between the two clusters, i.e.,\(\sum_{i\in {\hat{{G}}}}\mathbf{x}_{i,2}/|\hat{{G}}| - \sum_{i\in @@ -179,15 +179,17 @@

\(p_{2,\text{selective}}\) -yields a much more moderate \(p\)-value, and the test based on \(p_{2,\text{selective}}\) cannot reject the -null hypothesis when it holds. By contrast, the naive \(p\)-value is tiny and leads to an -anti-conservative test.

+

Now the selective \(p\)-value yields +a much more moderate \(p\)-value, and +the test based on it cannot reject the null hypothesis when it holds. By +contrast, the naive \(p\)-value is tiny +and leads to an anti-conservative test.

 cluster_1 <- 1
-cluster_2 <- 4
-cl_1_2_inference_demo <-  test_hier_clusters_exact_1f(X=X, link="average", hcl=hcl, K=3, k1=1, k2=3, feat=2)
-summary(cl_1_2_inference_demo)
+cluster_2 <- 3
+cl_inference_demo <-  test_hier_clusters_exact_1f(X=X, link="average",
+                                                      hcl=hcl, K=3, k1=cluster_1, k2=cluster_2, feat=2)
+summary(cl_inference_demo)
 #>   cluster_1 cluster_2  test_stat p_selective   p_naive
 #> 1         1         3 -0.1766818   0.8362984 0.8362984
diff --git a/docs/articles/Tutorials_hier_files/figure-html/unnamed-chunk-6-1.png b/docs/articles/Tutorials_hier_files/figure-html/unnamed-chunk-6-1.png deleted file mode 100644 index e8a089a..0000000 Binary files a/docs/articles/Tutorials_hier_files/figure-html/unnamed-chunk-6-1.png and /dev/null differ diff --git a/docs/articles/technical_details.html b/docs/articles/technical_details.html index db8c5e5..629707c 100644 --- a/docs/articles/technical_details.html +++ b/docs/articles/technical_details.html @@ -78,7 +78,7 @@

Technical details

-Figure 1: We simulated one dataset according to \({MN}_{100\times 10}(\mu, \textbf{I}_{100}, +Figure 1: We simulated one dataset according to \(MN_{100\times 10}(\mu, \textbf{I}_{100}, \Sigma)\), where \(\mu_i = (1,0_9)^T\) for \(i=1,\ldots, 50\) and \(\mu_i = (0_9,1)^T\) @@ -160,8 +160,8 @@

\(p_{j, \text{selective}}\), we have -conditioned on (i) the estimated clusters \(\mathcal{C}(x)\) to account for the -data-driven nature of the null hypothesis; and (ii) \(U(x)\) to eliminate the unknown nuisance +conditioned on (i) the estimated clusters \({C}(x)\) to account for the data-driven +nature of the null hypothesis; and (ii) \(U(x)\) to eliminate the unknown nuisance parameters under the null.

We show that this \(p\)-value for testing \(\hat{H}_{0j}\) can be written @@ -169,21 +169,22 @@

\(\mathbb{F}(t; -\mu, \sigma, \mathcal{S})\) denotes the cumulative distribution -function (CDF) of a \(N(\mu, -\sigma^2)\) random variable truncated to the set \(\mathcal{S}\), \(x'(\phi,j) = x + (\phi - ( \bar{x}_{\hat{G}j} -- \bar{x}_{\hat{G'}j})) \left ( \frac{ \hat{\nu} }{ -\|\hat{\nu}\|_2^2 } \right ) \left ( \frac{\Sigma_j}{\Sigma_{jj}} \right -)^T,\) and {\[\begin{equation} +\mu, \sigma, {S})\) denotes the cumulative distribution function +(CDF) of a \(N(\mu, \sigma^2)\) random +variable truncated to the set \({S}\), +\(x'(\phi,j) = x + (\phi - ( +\bar{x}_{\hat{G}j} - \bar{x}_{\hat{G'}j})) \left ( \frac{ \hat{\nu} +}{ \|\hat{\nu}\|_2^2 } \right ) \left ( \frac{\Sigma_j}{\Sigma_{jj}} +\right )^T,\) and \[\begin{equation} \hat S_j = -\left \{ \phi \in \mathbb{R}: C(x) = -\mathcal{C}\left(x'(\phi,j)\right ) \right \}. \label{eq:defS} -\end{equation}\] }.

+\left \{ \phi \in \mathbb{R}: C(x) = {C}\left(x'(\phi,j)\right ) +\right \}. \label{eq:defS} +\end{equation}\].

While the notation in the last paragraph might seem daunting, the intuition is simple: since \(p_{j, \text{selective}}\) can be rewritten into sums of CDFs of diff --git a/docs/pkgdown.yml b/docs/pkgdown.yml index 7ba3443..dc3b4ec 100644 --- a/docs/pkgdown.yml +++ b/docs/pkgdown.yml @@ -5,7 +5,7 @@ articles: Tutorials: Tutorials.html Tutorials_hier: Tutorials_hier.html technical_details: technical_details.html -last_built: 2023-11-26T01:22Z +last_built: 2023-11-26T01:39Z urls: reference: https://yiqunchen.github.io/CADET/reference article: https://yiqunchen.github.io/CADET/articles diff --git a/vignettes/Tutorials.Rmd b/vignettes/Tutorials.Rmd index 273aaca..52b9c6c 100644 --- a/vignettes/Tutorials.Rmd +++ b/vignettes/Tutorials.Rmd @@ -23,7 +23,7 @@ library(CADET) library(ggplot2) ``` -We first generate data according to $\mathbf{X} \sim {MN}_{n\times q}(\boldsymbol{\mu}, \textbf{I}_n, \sigma^2 \textbf{I}_q)$ with $n=150,q=2,\sigma=1,$ and +We first generate data according to $\mathbf{X} \sim MN_{n\times q}(\boldsymbol{\mu}, \textbf{I}_n, \sigma^2 \textbf{I}_q)$ with $n=150,q=2,\sigma=1,$ and \begin{align} \label{eq:power_model} \boldsymbol{\mu}_1 =\ldots = \boldsymbol{\mu}_{50} = \begin{bmatrix} @@ -35,7 +35,7 @@ We first generate data according to $\mathbf{X} \sim {MN}_{n\times q}(\boldsymbo \delta/2 \\ 0_{q-1} \end{bmatrix}. \end{align} -Here, we can think of ${C}_1 = \{1,\ldots,50\},{C}_2 = \{51,\ldots,100\},{C}_3 = \{101,\ldots,150\}$ as the "true clusters". +Here, we can think of $C_1 = \{1,\ldots,50\},C_2 = \{51,\ldots,100\},C_3 = \{101,\ldots,150\}$ as the "true clusters". In the figure below, we display one such simulation $\mathbf{x}\in\mathbb{R}^{100\times 2}$ with $\delta=10$. ```{r fig.align="center", fig.height = 5, fig.width = 5} @@ -98,7 +98,7 @@ cl_inference_demo <- kmeans_inference_1f(X, k=3, cluster_1, cluster_2, summary(cl_inference_demo) ``` -In the summary, we have the empirical difference in means of the second feature between the two clusters, i.e.,$\sum_{i\in {\hat{{G}}}}\mathbf{x}_{i,2}/|\hat{{G}}| - \sum_{i\in \hat{G}'}\mathbf{x}_{i,2}/|\hat{G}'|$ (`test_stats`), the naive p-value based on a z-test (`p_naive`), and the selective $p$-value (`p_selective`). In this case, the test based on $p_{\text{selective}}$ can reject this null hypothesis that the blue and pink clusters have the same mean in the first feature ($p_{2,\text{selective}}<0.001$). +In the summary, we have the empirical difference in means of the second feature between the two clusters, i.e.,$\sum_{i\in \hat{G}}\mathbf{x}_{i,2}/|\hat{{G}}| - \sum_{i\in \hat{G}'}\mathbf{x}_{i,2}/|\hat{G}'|$ (`test_stats`), the naive p-value based on a z-test (`p_naive`), and the selective $p$-value (`p_selective`). In this case, the test based on $p_{\text{selective}}$ can reject this null hypothesis that the blue and pink clusters have the same mean in the first feature ($p_{2,\text{selective}}<0.001$). ### Inference for k-means clustering when the null hypothesis holds diff --git a/vignettes/technical_details.Rmd b/vignettes/technical_details.Rmd index c73ae4d..cf1a0dd 100644 --- a/vignettes/technical_details.Rmd +++ b/vignettes/technical_details.Rmd @@ -19,7 +19,7 @@ knitr::opts_chunk$set(

![](../man/figures/fig_1.png){width=90%} -
Figure 1: We simulated one dataset according to ${MN}_{100\times 10}(\mu, \textbf{I}_{100}, \Sigma)$, where $\mu_i = (1,0_9)^T$ for $i=1,\ldots, 50$ and $\mu_i = (0_9,1)^T$ for $i=51,\ldots, 100$, and $\Sigma_{ij} = 1\{i=j\}+0.4\cdot 1\{i\neq j\}$. *(a)*: Empirical distribution of feature 2 based on the simulated data set. In this case, all observations have the same mean for feature 2. *(b)*: We apply k-means clustering to obtain two clusters and plot the empirical distribution of feature 2 stratified by the clusters. *(c)*: Quantile-quantile plot of naive z-test (black) our proposed p-values (orange) applied to the simulated data sets for testing the null hypotheses for a difference in means for features 2--8.
+
Figure 1: We simulated one dataset according to $MN_{100\times 10}(\mu, \textbf{I}_{100}, \Sigma)$, where $\mu_i = (1,0_9)^T$ for $i=1,\ldots, 50$ and $\mu_i = (0_9,1)^T$ for $i=51,\ldots, 100$, and $\Sigma_{ij} = 1\{i=j\}+0.4\cdot 1\{i\neq j\}$. *(a)*: Empirical distribution of feature 2 based on the simulated data set. In this case, all observations have the same mean for feature 2. *(b)*: We apply k-means clustering to obtain two clusters and plot the empirical distribution of feature 2 stratified by the clusters. *(c)*: Quantile-quantile plot of naive z-test (black) our proposed p-values (orange) applied to the simulated data sets for testing the null hypotheses for a difference in means for features 2--8.