lecture11.Rmd

```{r echo=FALSE}
library(tidyverse)

theme_set(theme_classic(base_size = 20)) 
```

#Visualizing multiple regression.
First we explore the data and plot a few associations
```{r echo=FALSE}
smoking=read_csv("smoking_data.csv")

library(skimr)
smoking %>% skim()

library(ggbeeswarm)
smoking %>%
  ggplot(aes(x=smoke,y=lung_capacity))+
  geom_beeswarm()
```

Being exposed to smoking increases lung capacity?
```{r}
smoking %>%
  ggplot(aes(x=age,y=lung_capacity))+
  geom_point()

smoking %>%
  ggplot(aes(x=age,y=lung_capacity,col=smoke))+
  geom_beeswarm()

smoking %>%
  ggplot(aes(x=age,y=lung_capacity,col=smoke))+
  geom_beeswarm()+
  stat_smooth(method="lm")
```

smoking has an negative effect when we control for age. 
```{r}
model1=lm(formula = lung_capacity ~ smoke, data = smoking)

model2=lm(formula = lung_capacity ~ age + smoke, data = smoking)

library(broom)
tidy(model1)
tidy(model2)
```

two ways to visualize this (what's another way?)
```{r}
#1 plot smoking variable vs. residuals of a model excluding smoking
## note that this doesn't always work well for visualization, but at least the x-axis is interpretable
y_given_x1=resid(lm(lung_capacity ~ age, data = smoking))
tibble(smoke=smoking$smoke,y_given_x1=y_given_x1) %>%
  ggplot(aes(x=smoke,y=y_given_x1))+
  geom_beeswarm(alpha=0.1)

#2 partial residuals, this usually works, but the x-axis can sometimes be less interpretable. 
smoking$smoke_integer=as.numeric(as.factor(smoking$smoke))
head(smoking$smoke_integer)
x2_given_x1=resid(lm(smoke_integer ~ age, data = smoking))
y_given_x1=resid(lm(lung_capacity ~ age, data = smoking))

tibble(x2_given_x1,y_given_x1) %>%
  ggplot(aes(x=x2_given_x1,y=y_given_x1))+
  geom_point()+
  stat_smooth(method="lm")

```


#Multiple hypothesis testing
## Why it is important
```{r}
m=seq(1:100)
one_error=1 - (1 - 0.05)^seq(1,100)
qplot(m,one_error,xlab="m",ylab="P(at least 1 false positive)")
```

simulation with correlation of random variables
```{r}
set.seed(10)
x=rnorm(100)
y=rnorm(100)
cor.test(x,y)
cor.test(x,y)$p.val
```

simulation with correlation of random variables
```{r}
set.seed(10)
random_cors=Reduce(c,lapply(1:10000,function(iter){
  x=rnorm(100)
  y=rnorm(100)
  correlation=cor.test(x,y)$p.val
  return(correlation)
}))

hist(random_cors,100)

##what proportion have p<0.05?
sum(random_cors<0.05)/10000
```


##p-value corrections
make a vector, x, of length 1000. The first 900 entries are random numbers with a standard normal distribution. The last 100 are random numbers from a normal distribution with mean 3 and
sd 1.
```{r}
set.seed(10)
x <- c(rnorm(900), rnorm(100, mean = 3)) 
```

Hypothesis test that the value of x is not different from 0, given the entries are drawn from a standard normal distribution. The alternate is a one-sided test, claiming that the value is larger than 0.
```{r}
p <- pnorm(x, lower.tail = F)

test = p > 0.05
summary(test[1:900])
summary(test[901:1000])
```

What is the Type I error rate?
What is the Type II error rate?


###Bonferroni
```{r}
bonftest = p > 0.05/1000
summary(bonftest[1:900])
summary(bonftest[901:1000])
```

Now what are the Type I and Type II error rates?

###Benjamini-Hochberg FDR
```{r}
prank <- rank(p) ##rank p-values
m=1000 ## number of tests
bh_fdr=p*m/prank

bhtest= bh_fdr > 0.05
summary(bhtest[1:900])
summary(bhtest[901:1000])

## can get this from the p.adjust function as well
bh_fdr[990:1000]
p.adjust(p,method="BH")[990:1000]

plot(bh_fdr[990:1000],p.adjust(p,method="BH")[990:1000],xlab="BH by hand",ylab="BH by p.adjust")
abline(0,1)
```

Now what are the Type I and Type II error rates?


new p-value distribution for the q-value (Storey & Tibshirani, PNAS, 2003)
```{r}
set.seed(100)
x = c(rnorm(9000), rnorm(1000, mean = 1)) 
pvalues= pnorm(x, lower.tail = F)
p_tibble=tibble(pvalue=pvalues,
                outcome=c(rep("null",9000),rep("alternative",1000)))

```

under the null, p-value distribution is uniform 
```{r}
hist(pvalues[1:9000],50)
```

under the alternative, p-value distribution is skewedm 
```{r}
hist(pvalues[9001:10000],50)
```

Combined, the distribution is a mixture of p-values from the null and alternative hypotheses
```{r}
hist(pvalues,50)
```

mixutre of two distributions
```{r}
p_tibble %>%
ggplot(aes(x=pvalue,fill=outcome))+
geom_histogram(bins=25)
```