forked from nsmackler/dataviz
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlecture11.Rmd
186 lines (146 loc) · 4.2 KB
/
lecture11.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
```{r echo=FALSE}
library(tidyverse)
theme_set(theme_classic(base_size = 20))
```
#Visualizing multiple regression.
First we explore the data and plot a few associations
```{r echo=FALSE}
smoking=read_csv("smoking_data.csv")
library(skimr)
smoking %>% skim()
library(ggbeeswarm)
smoking %>%
ggplot(aes(x=smoke,y=lung_capacity))+
geom_beeswarm()
```
Being exposed to smoking increases lung capacity?
```{r}
smoking %>%
ggplot(aes(x=age,y=lung_capacity))+
geom_point()
smoking %>%
ggplot(aes(x=age,y=lung_capacity,col=smoke))+
geom_beeswarm()
smoking %>%
ggplot(aes(x=age,y=lung_capacity,col=smoke))+
geom_beeswarm()+
stat_smooth(method="lm")
```
smoking has an negative effect when we control for age.
```{r}
model1=lm(formula = lung_capacity ~ smoke, data = smoking)
model2=lm(formula = lung_capacity ~ age + smoke, data = smoking)
library(broom)
tidy(model1)
tidy(model2)
```
two ways to visualize this (what's another way?)
```{r}
#1 plot smoking variable vs. residuals of a model excluding smoking
## note that this doesn't always work well for visualization, but at least the x-axis is interpretable
y_given_x1=resid(lm(lung_capacity ~ age, data = smoking))
tibble(smoke=smoking$smoke,y_given_x1=y_given_x1) %>%
ggplot(aes(x=smoke,y=y_given_x1))+
geom_beeswarm(alpha=0.1)
#2 partial residuals, this usually works, but the x-axis can sometimes be less interpretable.
smoking$smoke_integer=as.numeric(as.factor(smoking$smoke))
head(smoking$smoke_integer)
x2_given_x1=resid(lm(smoke_integer ~ age, data = smoking))
y_given_x1=resid(lm(lung_capacity ~ age, data = smoking))
tibble(x2_given_x1,y_given_x1) %>%
ggplot(aes(x=x2_given_x1,y=y_given_x1))+
geom_point()+
stat_smooth(method="lm")
```
#Multiple hypothesis testing
## Why it is important
```{r}
m=seq(1:100)
one_error=1 - (1 - 0.05)^seq(1,100)
qplot(m,one_error,xlab="m",ylab="P(at least 1 false positive)")
```
simulation with correlation of random variables
```{r}
set.seed(10)
x=rnorm(100)
y=rnorm(100)
cor.test(x,y)
cor.test(x,y)$p.val
```
simulation with correlation of random variables
```{r}
set.seed(10)
random_cors=Reduce(c,lapply(1:10000,function(iter){
x=rnorm(100)
y=rnorm(100)
correlation=cor.test(x,y)$p.val
return(correlation)
}))
hist(random_cors,100)
##what proportion have p<0.05?
sum(random_cors<0.05)/10000
```
##p-value corrections
make a vector, x, of length 1000. The first 900 entries are random numbers with a standard normal distribution. The last 100 are random numbers from a normal distribution with mean 3 and
sd 1.
```{r}
set.seed(10)
x <- c(rnorm(900), rnorm(100, mean = 3))
```
Hypothesis test that the value of x is not different from 0, given the entries are drawn from a standard normal distribution. The alternate is a one-sided test, claiming that the value is larger than 0.
```{r}
p <- pnorm(x, lower.tail = F)
test = p > 0.05
summary(test[1:900])
summary(test[901:1000])
```
What is the Type I error rate?
What is the Type II error rate?
###Bonferroni
```{r}
bonftest = p > 0.05/1000
summary(bonftest[1:900])
summary(bonftest[901:1000])
```
Now what are the Type I and Type II error rates?
###Benjamini-Hochberg FDR
```{r}
prank <- rank(p) ##rank p-values
m=1000 ## number of tests
bh_fdr=p*m/prank
bhtest= bh_fdr > 0.05
summary(bhtest[1:900])
summary(bhtest[901:1000])
## can get this from the p.adjust function as well
bh_fdr[990:1000]
p.adjust(p,method="BH")[990:1000]
plot(bh_fdr[990:1000],p.adjust(p,method="BH")[990:1000],xlab="BH by hand",ylab="BH by p.adjust")
abline(0,1)
```
Now what are the Type I and Type II error rates?
new p-value distribution for the q-value (Storey & Tibshirani, PNAS, 2003)
```{r}
set.seed(100)
x = c(rnorm(9000), rnorm(1000, mean = 1))
pvalues= pnorm(x, lower.tail = F)
p_tibble=tibble(pvalue=pvalues,
outcome=c(rep("null",9000),rep("alternative",1000)))
```
under the null, p-value distribution is uniform
```{r}
hist(pvalues[1:9000],50)
```
under the alternative, p-value distribution is skewedm
```{r}
hist(pvalues[9001:10000],50)
```
Combined, the distribution is a mixture of p-values from the null and alternative hypotheses
```{r}
hist(pvalues,50)
```
mixutre of two distributions
```{r}
p_tibble %>%
ggplot(aes(x=pvalue,fill=outcome))+
geom_histogram(bins=25)
```