diff --git a/linear-models/association-not-causation.qmd b/linear-models/association-not-causation.qmd index f822bbe..4205fe9 100644 --- a/linear-models/association-not-causation.qmd +++ b/linear-models/association-not-causation.qmd @@ -1,12 +1,12 @@ # Association is not causation -*Association is not causation* is perhaps the most important lesson one learns in a statistics class. *Correlation is not causation* is another way to say this. Throughout the Statistics part of the book, we have described tools useful for quantifying associations between variables. However, we must be careful not to over-interpret these associations. +*Association is not causation* is perhaps the most important lesson one can learn in a statistics class. *Correlation is not causation* is another way to say this. Throughout the statistics part of the book, we have described tools useful for quantifying associations between variables. However, we must be careful not to over-interpret these associations. -There are many reasons that a variable $X$ can be correlated with a variable $Y$ without having any direct effect on $Y$. Here we examine four common ways that can lead to misinterpreting data. +There are many reasons that a variable $X$ can be correlated with a variable $Y$, without having any direct effect on $Y$. Below we examine four common ways that can lead to misinterpreting data. ## Spurious correlation -The following comical example underscores that correlation is not causation. It shows a very strong correlation between divorce rates and margarine consumption. +The following comical example underscores the concept that correlation is not causation. It shows a very strong correlation between divorce rates and margarine consumption. ```{r divorce-versus-margarine, echo=FALSE, message=FALSE, warning=FALSE, cache=FALSE} library(tidyverse) @@ -24,13 +24,13 @@ divorce_margarine |> ylab("Divorce rate in Maine (per 1000)") ``` -Does this mean that margarine causes divorces? Or do divorces cause people to eat more margarine? Of course the answer to both these questions is no. This is just an example of what we call a *spurious correlation*. +Does this mean that margarine causes divorces? Or do divorces cause people to eat more margarine? Of course. the answer to both these questions is no. This is just an example of what we call a *spurious correlation*. You can see many more absurd examples on the Spurious Correlations website[^association-not-causation-1]. [^association-not-causation-1]: http://tylervigen.com/spurious-correlations -The cases presented in the spurious correlation site are all instances of what is generally called *data dredging*, *data fishing*, or *data snooping*. It's basically a form of what in the US they call *cherry picking*. An example of data dredging would be if you look through many results produced by a random process and pick the one that shows a relationship that supports a theory you want to defend. +The cases presented on the website are all instances of what is generally called *data dredging*, *data fishing*, or *data snooping*. It's basically a form of what in the US they call *cherry picking*. An example of data dredging would be if you look through many results produced by a random process and pick the one that shows a relationship that supports a theory you want to defend. A Monte Carlo simulation can be used to show how data dredging can result in finding high correlations among uncorrelated variables. We will save the results of our simulation into a tibble: @@ -43,7 +43,7 @@ sim_data <- tibble(group = rep(1:g, each = N), y = rnorm(N*g)) ``` -The first column denotes group. We created `r cat(prettyNum(g, big.mark=",",scientific=FALSE))` groups and for each one we generated a pair of independent vectors, $X$ and $Y$, with `r N` observations each, stored in the second and third columns. Because we constructed the simulation, we know that $X$ and $Y$ are not correlated. +The first column denotes group. We created `r cat(prettyNum(g, big.mark=",",scientific=FALSE))` groups. For each group, we generated a pair of independent vectors, $X$ and $Y$, with `r N` observations each, stored in the second and third columns. Because we constructed the simulation, we know that $X$ and $Y$ are not correlated. Next, we compute the correlation between `X` and `Y` for each group and look at the max: @@ -55,7 +55,7 @@ res <- sim_data |> res ``` -We see a maximum correlation of `r max(res$r)` and if you just plot the data from the group achieving this correlation, it shows a convincing plot that $X$ and $Y$ are in fact correlated: +We see a maximum correlation of `r max(res$r)`. If you just plot the data from the group achieving this correlation, it shows a convincing plot that $X$ and $Y$ are in fact correlated: ```{r dredging} sim_data |> filter(group == res$group[which.max(res$r)]) |> @@ -70,7 +70,7 @@ Remember that the correlation summary is a random variable. Here is the distribu res |> ggplot(aes(x=r)) + geom_histogram(binwidth = 0.1, color = "black") ``` -It's just a mathematical fact that if we observe `r cat(prettyNum(g, big.mark=",",scientific=FALSE))` random correlations that are expected to be 0, but have a standard error of `r sd(res$r)`, the largest one will be close to 1. +It's simply a mathematical fact that if we observe `r cat(prettyNum(g, big.mark=",",scientific=FALSE))` random correlations that are expected to be 0, but have a standard error of `r sd(res$r)`, the largest one will be close to 1. If we performed regression on this group and interpreted the p-value, we would incorrectly claim this was a statistically significant relation: @@ -82,7 +82,7 @@ sim_data |> filter(term == "x") ``` -This particular form of data dredging is referred to as *p-hacking*. P-hacking is a topic of much discussion because it is a problem in scientific publications. Because publishers tend to reward statistically significant results over negative results, there is an incentive to report significant results. In epidemiology and the social sciences, for example, researchers may look for associations between an adverse outcome and several exposures and report only the one exposure that resulted in a small p-value. Furthermore, they might try fitting several different models to account for confounding and pick the one that yields the smallest p-value. In experimental disciplines, an experiment might be repeated more than once, yet only the results of the one experiment with a small p-value reported. This does not necessarily happen due to unethical behavior, but rather as a result of statistical ignorance or wishful thinking. In advanced statistics courses, you can learn methods to adjust for these multiple comparisons. +This particular form of data dredging is referred to as *p-hacking*. P-hacking is a topic of much discussion because it poses a problem in scientific publications. Since publishers tend to reward statistically significant results over negative results, there is an incentive to report significant results. In epidemiology and the social sciences, for example, researchers may look for associations between an adverse outcome and several exposures, and report only the one exposure that resulted in a small p-value. Furthermore, they might try fitting several different models to account for confounding and choose the one that yields the smallest p-value. In experimental disciplines, an experiment might be repeated more than once, yet only the results of the one experiment with a small p-value reported. This does not necessarily happen due to unethical behavior, but rather as a result of statistical ignorance or wishful thinking. In advanced statistics courses, you can learn methods to adjust for these multiple comparisons. ## Outliers @@ -120,7 +120,7 @@ There is an alternative to the sample correlation for estimating the population qplot(rank(x), rank(y)) ``` -The outlier is no longer associated with a very large value and the correlation comes way down: +The outlier is no longer associated with a very large value, and the correlation decreases significantly: ```{r} cor(rank(x), rank(y)) @@ -150,7 +150,7 @@ We can easily construct an example of cause and effect reversal using the father $$X_i = \beta_0 + \beta_1 y_i + \varepsilon_i, i=1, \dots, N$$ -to the father and son height data, with $X_i$ the father height and $y_i$ the son height, we do get a statistically significant result. We use the `galton_heights` dataset defined in Chapter @sec-regression: +to the father and son height data, where $X_i$ is the father height and $y_i$ is the son height, we do get a statistically significant result. We use the `galton_heights` dataset defined in @sec-regression: ```{r, echo=FALSE, cache=FALSE} library(HistData) @@ -169,13 +169,13 @@ galton_heights <- GaltonFamilies |> galton_heights |> summarize(tidy(lm(father ~ son))) ``` -The model fits the data very well. If we look at the mathematical formulation of the model above, it could easily be incorrectly interpreted so as to suggest that the son being tall caused the father to be tall. But given what we know about genetics and biology, we know it's the other way around. The model is technically correct. The estimates and p-values were obtained correctly as well. What is wrong here is the interpretation. +The model fits the data very well. If we examine the mathematical formulation of the model above, it could easily be misinterpreted so as to suggest that the son being tall caused the father to be tall. However, based on our understanding of genetics and biology, we know it's the other way around. The model is technically correct. The estimates and p-values were obtained correctly as well. What is wrong here is the interpretation. ## Confounders Confounders are perhaps the most common reason that leads to associations begin misinterpreted. -If $X$ and $Y$ are correlated, we call $Z$ a *confounder* if changes in $Z$ causes changes in both $X$ and $Y$. Earlier, when studying baseball data, we saw how Home Runs was a confounder that resulted in a higher correlation than expected when studying the relationship between Bases on Balls and Runs. In some cases, we can use linear models to account for confounders. However, this is not always the case. +If $X$ and $Y$ are correlated, we call $Z$ a *confounder* if changes in $Z$ cause changes in both $X$ and $Y$. Earlier, when studying baseball data, we saw how Home Runs were a confounder that resulted in a higher correlation than expected when studying the relationship between Bases on Balls and Runs. In some cases, we can use linear models to account for confounders. However, this is not always the case. Incorrect interpretation due to confounders is ubiquitous in the lay press and they are often hard to detect. Here, we present a widely used example related to college admissions. @@ -204,7 +204,7 @@ Four out of the six majors favor women. More importantly, all the differences ar The paradox is that analyzing the totals suggests a dependence between admission and gender, but when the data is grouped by major, this dependence seems to disappear. What's going on? This actually can happen if an uncounted confounder is driving most of the variability. -So let's define three variables: $X$ is 1 for men and 0 for women, $Y$ is 1 for those admitted and 0 otherwise, and $Z$ quantifies the selectivity of the major. A gender bias claim would be based on the fact that $\mbox{Pr}(Y=1 | X = x)$ is higher for $x=1$ than $x=0$. However, $Z$ is an important confounder to consider. Clearly $Z$ is associated with $Y$, as the more selective a major, the lower $\mbox{Pr}(Y=1 | Z = z)$. But is major selectivity $Z$ associated with gender $X$? +So let's define three variables: $X$ is 1 for men and 0 for women, $Y$ is 1 for those admitted and 0 otherwise, and $Z$ quantifies the selectivity of the major. A gender bias claim would be based on the fact that $\mbox{Pr}(Y=1 | X = x)$ is higher for $x=1$ than for $x=0$. However, $Z$ is an important confounder to consider. Clearly, $Z$ is associated with $Y$, as the more selective a major, the lower $\mbox{Pr}(Y=1 | Z = z)$. But is major selectivity $Z$ associated with gender $X$? One way to see this is to plot the total percent admitted to a major versus the percent of women that made up the applicants: @@ -218,11 +218,11 @@ admissions |> geom_text() ``` -There seems to be association. The plot suggests that women were much more likely to apply to the two "hard" majors: gender and major's selectivity are confounded. Compare, for example, major B and major E. Major E is much harder to enter than major B and over 60% of applicants to major E were women, while less than 30% of the applicants of major B were women. +There seems to be association. The plot suggests that women were much more likely to apply to the two "hard" majors, indicating a confounding between gender and major's selectivity. Compare, for example, major B and major E. Major E is much harder to enter than major B, and over 60% of applicants for major E were women, while less than 30% of the applicants for major B were women. ### Confounding explained graphically -The following plot shows the number of applicants that were admitted and those that were not by: +FIX BY WHAT The following plot shows the number of applicants that were admitted and those that were not by: ```{r confounding, echo=FALSE} admissions |> @@ -235,7 +235,7 @@ admissions |> ``` -It also breaks down the acceptances by major. This breakdown allows us to see that the majority of accepted men came from two majors: A and B. It also lets us see that few women applied to these majors. +It also breaks down the acceptances by major. This breakdown allows us to see that the majority of accepted men came from two majors, A and B. It also reveals that few women applied to these majors. ### Average after stratifying @@ -257,7 +257,7 @@ admissions |> group_by(gender) |> summarize(average = mean(admitted)) ## Simpson's paradox -The case we have just covered is an example of Simpson's paradox. It is called a paradox because we see the sign of the correlation flip when comparing the entire publication and specific strata. As an illustrative example, suppose you have three random variables $X$, $Y$, and $Z$ and that we observe realizations of these. Here is a plot of simulated observations for $X$ and $Y$ along with the sample correlation: +The case we have just covered is an example of Simpson's paradox. It is called a paradox because we see the sign of the correlation flip when comparing the entire publication to specific strata. As an illustrative example, suppose you have three random variables $X$, $Y$, and $Z$, and we observe realizations of these. Here is a plot of simulated observations for $X$ and $Y$ along with the sample correlation: ```{r simpsons-paradox, echo=FALSE} N <- 100 @@ -281,7 +281,7 @@ dat |> ggplot(aes(x, y)) + geom_point(alpha = 0.5) + ggtitle(paste("Correlation = ", round(cor(dat$x, dat$y), 2))) ``` -You can see that $X$ and $Y$ are negatively correlated. However, once we stratify by $Z$ (shown in different colors below) another pattern emerges: +You can see that $X$ and $Y$ are negatively correlated. However, once we stratify by $Z$ (shown in different colors below), another pattern emerges: ```{r simpsons-paradox-explained, echo=FALSE} means <- do.call(rbind, means) |> @@ -296,7 +296,7 @@ dat |> ggplot(aes(x, y, color = z)) + annotate("text", x = means$x, y = means$y, label = paste("z =", means$z), cex = 5) ``` -It is really $Z$ that is negatively correlated with $X$. If we stratify by $Z$, the $X$ and $Y$ are actually positively correlated as seen in the plot above. +It is really $Z$ that is negatively correlated with $X$. If we stratify by $Z$, the $X$ and $Y$ are actually positively correlated, as seen in the plot above. ## Exercises @@ -312,9 +312,9 @@ A response[^association-not-causation-4] was published a few months later titled > However, the overall gender effect borders on statistical significance, despite the large sample. Moreover, their conclusion could be a prime example of Simpson's paradox; if a higher percentage of women apply for grants in more competitive scientific disciplines (i.e., with low application success rates for both men and women), then an analysis across all disciplines could incorrectly show "evidence" of gender inequality. -Who is right here? The original paper or the response? Here, you will examine the data and come to your own conclusion. +Who is correct here, the original paper or the response? Below, you will examine the data and come to your own conclusion. -1\. The main evidence for the conclusion of the original paper comes down to a comparison of the percentages. Table S1 in the paper includes the information we need: +1\. The primary evidence for the conclusion of the original paper relies on a comparison of the percentages. Table S1 in the paper includes the information we need: ```{r,eval=FALSE} library(dslabs) @@ -327,12 +327,12 @@ Construct the two-by-two table used for the conclusion about differences in awar 3\. In the previous exercise, we noticed that the success rate is lower for women. But is it significant? Compute a p-value using a Chi-square test. -4\. We see that the p-value is about 0.05. So there appears to be some evidence of an association. But can we infer causation here? Is gender bias causing this observed difference? The response to the original paper claims that what we see here is similar to the UC Berkeley admissions example. Specifically they state that this "could be a prime example of Simpson's paradox; if a higher percentage of women apply for grants in more competitive scientific disciplines, then an analysis across all disciplines could incorrectly show 'evidence' of gender inequality." To settle this dispute, create a dataset with number of applications, awards, and success rate for each gender. Re-order the disciplines by their overall success rate. Hint: use the `reorder` function to re-order the disciplines in a first step, then use `pivot_longer`, `separate`, and `pivot_wider` to create the desired table. +4\. We see that the p-value is about 0.05. So there appears to be some evidence of an association. But can we infer causation here? Is gender bias causing this observed difference? The response to the original paper claims that what we see here is similar to the UC Berkeley admissions example. Specifically, they state that this "could be a prime example of Simpson's paradox; if a higher percentage of women apply for grants in more competitive scientific disciplines, then an analysis across all disciplines could incorrectly show 'evidence' of gender inequality." To settle this dispute, create a dataset with number of applications, awards, and success rate for each gender. Re-order the disciplines by their overall success rate. Hint: use the `reorder` function to re-order the disciplines in a first step, then use `pivot_longer`, `separate`, and `pivot_wider` to create the desired table. 5\. To check if this is a case of Simpson's paradox, plot the success rates versus disciplines, which have been ordered by overall success, with colors to denote the genders and size to denote the number of applications. -6\. We definitely do not see the same level of confounding as in the UC Berkeley example. It is hard to say there is a confounder here. However, we do see that, based on the observed rates, some fields favor men and others favor women and we do see that the two fields with the largest difference favoring men are also the fields with the most applications. But, unlike the UC Berkeley example, women are not more likely to apply for the harder subjects. So perhaps some of the selection committees are biased and others are not. +6\. We definitely do not see the same level of confounding as in the UC Berkeley example. It is hard to say that there is a confounder here. However, we do see that, based on the observed rates, some fields favor men and others favor women. We also see that the two fields with the largest difference favoring men are also the fields with the most applications. But, unlike the UC Berkeley example, women are not more likely to apply for the harder subjects. So perhaps some of the selection committees are biased and others are not. -But, before we conclude this, we must check if these differences are any different than what we get by chance. Are any of the differences seen above statistically significant? Keep in mind that even when there is no bias, we will see differences due to random variability in the review process as well as random variability across candidates. Perform a Chi-square test for each discipline. Hint: define a function that receives the total of a two-by-two table and returns a data frame with the p-value. Use the 0.5 correction. Then use the `summarize` function. +FIX CHECK I REPHRASED However, before concluding this, we must check if any of the differences seen above are statistically significant or if they could occur by chance. Keep in mind that even when there is no bias, we will see differences due to random variability in the review process as well as random variability across candidates. Perform a Chi-square test for each discipline. Hint: define a function that receives the total of a two-by-two table and returns a data frame with the p-value. Use the 0.5 correction. Then use the `summarize` function. -7\. For the medical sciences, there appears to be a statistically significant difference. But is this a spurious correlation? We performed 9 tests. Reporting only the one case with a p-value less than 0.05 might be considered an example of cherry picking. Repeat the exercise above, but instead of a p-value, compute a log odds ratio divided by their standard error. Then use qq-plot to see how much these log odds ratios deviate from the normal distribution we would expect: a standard normal distribution. +7\. In the medical sciences, there appears to be a statistically significant difference, but could this be a spurious correlation? We performed 9 tests. Reporting only the one case with a p-value less than 0.05 might be considered an example of cherry picking. Repeat the exercise above, but instead of a p-value, compute a log odds ratio divided by their standard error. Then use qq-plot to see how much these log odds ratios deviate from the normal distribution we would expect: a standard normal distribution. diff --git a/linear-models/association-tests.qmd b/linear-models/association-tests.qmd index 3faacad..11869cb 100644 --- a/linear-models/association-tests.qmd +++ b/linear-models/association-tests.qmd @@ -4,9 +4,9 @@ set.seed(1984) ``` -The statistical models studied up to now are appropriate for continuous outcomes. We have not yet discussed inference for binary, categorical, and ordinal data. To give a very specific example, we will consider a case study examining funding success rates in the Netherlands, by gender. +The statistical models studied up to now are appropriate for continuous outcomes. We have not yet discussed inference for binary, categorical, and ordinal data. To give a very specific example, we will consider a case study examining funding success rates in the Netherlands, categorized by gender. -## Case study: funding success rates +## Case study: Funding success rates A 2014 PNAS paper[^association-tests-1] analyzed success rates from funding agencies in the Netherlands and concluded that their: @@ -14,7 +14,7 @@ A 2014 PNAS paper[^association-tests-1] analyzed success rates from funding agen > results reveal gender bias favoring male applicants over female applicants in the prioritization of their "quality of researcher" (but not "quality of proposal") evaluations and success rates, as well as in the language use in instructional and evaluation materials. -The main evidence for this conclusion comes down to a comparison of the percentages. Table S1 in the paper includes the information we need. Here are the three columns showing the overall outcomes: +The main evidence supporting this conclusion is based on a comparison of the percentages. Table S1 in the paper includes the information we need. Here are the three columns showing the overall outcomes: ```{r, message=FALSE, warning=FALSE, echo=FALSE, cache=FALSE} library(tidyverse) @@ -62,11 +62,11 @@ R.A. Fisher[^association-tests-2] was one of the first to formalize hypothesis t [^association-tests-2]: https://en.wikipedia.org/wiki/Ronald_Fisher -The story is as follows: an acquaintance of Fisher's claimed that she could tell if milk was added before or after tea was poured. Fisher was skeptical. He designed an experiment to test this claim. He gave her four pairs of cups of tea: one with milk poured first, the other after. The order was randomized. The null hypothesis here is that she is guessing. Fisher derived the distribution for the number of correct picks on the assumption that the choices were random and independent. +The story is as follows: an acquaintance of Fisher's claimed that she could tell if milk was added before or after tea was poured. Fisher was skeptical and, consequently, designed an experiment to test this claim. He gave her four pairs of cups of tea: one with milk poured first, the other after. The order was randomized. The null hypothesis here is that she is guessing. Fisher derived the distribution for the number of correct picks on the assumption that the choices were random and independent. -As an example, suppose she picked 3 out of 4 correctly. Do we believe she has a special ability? The basic question we ask is: if the tester is actually guessing, what are the chances that she gets 3 or more correct? Just as we have done before, we can compute a probability under the null hypothesis that she is guessing 4 of each. Under this null hypothesis, we can think of this particular example as picking 4 balls out of an urn with 4 blue (correct answer) and 4 red (incorrect answer) balls. Remember, she knows that there are four before tea and four after. +As an example, suppose she identified 3 out of 4 correctly. Do we believe she has a special ability? The basic question we ask is: if the tester is actually guessing, what are the chances that she gets 3 or more correct? Just as we have done before, we can compute a probability under the null hypothesis that she is guessing for all 4. Under this null hypothesis, we can think of this particular example as picking 4 balls out of an urn with 4 blue (correct answer) and 4 red (incorrect answer) balls. Remember, she knows that there are four before tea and four after. -Under the null hypothesis that she is simply guessing, each ball has the same chance of being picked. We can then use combinations to figure out each probability. The probability of picking 3 is $\binom{4}{3} \binom{4}{1} / \binom{8}{4} = 16/70$. The probability of picking all 4 correct is $\binom{4}{4} \binom{4}{0} / \binom{8}{4}= 1/70$. Thus, the chance of observing a 3 or something more extreme, under the null hypothesis, is $\approx 0.24$. This is the p-value. The procedure that produced this p-value is called *Fisher's exact test* and it uses the *hypergeometric distribution*. +Under the null hypothesis that she is simply guessing, each ball has the same chance of being picked. We can then use combinations to determine each probability. The probability of picking 3 is $\binom{4}{3} \binom{4}{1} / \binom{8}{4} = 16/70$. The probability of picking all 4 correct is $\binom{4}{4} \binom{4}{0} / \binom{8}{4}= 1/70$. Thus, the chance of observing a 3 or something more extreme, under the null hypothesis, is $\approx 0.24$. This is the p-value. The procedure that produced this p-value is called *Fisher's exact test* and it uses the *hypergeometric distribution*. ## Two-by-two tables @@ -79,7 +79,7 @@ colnames(tab) <- c("Guessed before", "Guessed after") tab ``` -These are referred to as a two-by-two table. For each of the four combinations one can get with a pair of binary variables, they show the observed counts for each occurrence. +These are referred to as a two-by-two table. For each of the four combinations can result from a pair of binary variables, they display the observed counts for each occurrence. The function `fisher.test` performs the inference calculations above: @@ -89,16 +89,16 @@ fisher.test(tab, alternative = "greater")$p.value ## Chi-square Test -Notice that, in a way, our funding rates example is similar to the Lady Tasting Tea. However, in the Lady Tasting Tea example, the number of blue and red beads is experimentally fixed and the number of answers given for each category is also fixed. This is because Fisher made sure there were four cups with milk poured before tea and four cups with milk poured after and the lady knew this, so the answers would also have to include four before and four afters. If this is the case, the sum of the rows and the sum of the columns are fixed. This defines constraints on the possible ways we can fill the two by two table and also permits us to use the hypergeometric distribution. In general, this is not the case. Nonetheless, there is another approach, the Chi-squared test, which is described below. +Notice that, in a sense, our funding rates example is similar to the Lady Tasting Tea. However, in the Lady Tasting Tea example, the number of blue and red beads is experimentally fixed and the number of answers given for each category is also fixed. This is because Fisher ensured there were four cups with milk poured before tea and four cups with milk poured after, and the lady knew this. Therefore, the answers would also have to include four before and four afters. In this case, the sum of the rows and the sum of the columns are fixed. This defines constraints on the possible ways we can fill the two by two table and also allows us to use the hypergeometric distribution. In general, this is not the case. Nonetheless, there is another approach, the Chi-squared test, which is described below. -Imagine we have `r prettyNum(totals, ,big.mark=",")` applicants, some are men and some are women and some get funded, whereas others don't. We saw that the success rates for men and woman were: +Imagine we have a total of `r prettyNum(totals, ,big.mark=",")` applicants-- some are men and some are women, and some get funded while others do not. We saw that the success rates for men and women respectively were: ```{r} totals |> summarize(percent_men = yes_men/(yes_men + no_men), percent_women = yes_women/(yes_women + no_women)) ``` -respectively. Would we see this again if we randomly assign funding at the overall rate: +Would we see this again if we randomly assign funding at the overall rate: ```{r} rate <- with(totals, (yes_men + yes_women))/sum(totals) @@ -124,7 +124,7 @@ e <- with(totals, data.frame(men = (no_men + yes_men) * c(1 - rate, rate), e ``` -We can see that more men than expected and fewer women than expected received funding. However, under the null hypothesis these observations are random variables. The Chi-square statistic quantifies how much the observed tables deviates from the expected by +We can see that more men than expected and fewer women than expected received funding. However, under the null hypothesis these observations are random variables. The Chi-square statistic quantifies how much the observed tables deviates from the expected by: 1. Taking the difference between each observed and expected cell value. 2. Squaring this difference. @@ -148,7 +148,7 @@ chisq_test$p.value ``` :::{.callout-note} -By default the `chisq.test` function applies a _continuity correction_. This correction is particularly useful when a cell in the table has values close to 0 as it avoid low observed values inflating the statistics. This achieved by subtracting 0.5 in the following way: +By default, the `chisq.test` function applies a _continuity correction_. This correction is particularly useful when a cell in the table has values close to 0, as it prevents low observed values from inflating the statistics. This achieved by subtracting 0.5 in the following way: ```{r} sum((abs(o - e) - 0.5)^2/e) @@ -166,7 +166,7 @@ chisq.test(o)$statistic ## Generalized linear models {#sec-glm} -We presented a way to perform hypothesis testing for determining if there is association between two binary outcome. But we have not yet described how to quantify effects. Can we estimate the effect of being a woman in funding success in the Netherlands? Note that if our outcomes are binary, then the linear models presented in the Chapter @sec-treatment-effect-models are not appropriate because the $\beta$s and $\varepsilon$ are continuous. However, an adaptation of these methods, that is widely used in, for example, medical studies, gives us a way to estimate effects along with their standard errors. +We presented a way to perform hypothesis testing for determining if there is association between two binary outcomes. But we have not yet described how to quantify effects. Can we estimate the effect of being a woman in funding success in the Netherlands? Note that if our outcomes are binary, then the linear models presented in the @sec-treatment-effect-models are not appropriate because the $\beta$s and $\varepsilon$ are continuous. However, an adaptation of these methods, that is widely used in, for example, medical studies, gives us a way to estimate effects along with their standard errors. The idea is to model a transformation of the expected value of the outcomes with a linear model. The transformation is selected so that any continuous value is possible. The mathematical equation for a model with one variable looks like this: @@ -174,9 +174,9 @@ $$ g\{\mbox{E}(Y_i)\} = \beta_0 + \beta_1 x_i $$ -To finish describing the model we impose a distribution on $Y$ such as binomial or Poisson. These are referred to as _generalized linear models_. +To finish describing the model, we impose a distribution on $Y$, such as binomial or Poisson. These are referred to as _generalized linear models_. -We demonstrate with the funding rates example. We define $Y_i$ to be 1 if person $i$ received funding and 0 otherwise and $x_i$ to be 1 for person $i$ is a women and 0 for men. For this data the expected value of $Y_i$ is the probability of funding for person $i$ $\mbox{Pr}(Y_i=1)$. We assume the outcomes $Y_i$ are binomial with $N=1$ and probability $p_i$. For binomial data, the most widely used transformation is the logit function $g(p) = \log \{p / (1-p)\}$ which takes numbers between 0 and 1 to any continuous number. The model looks like this: +We illustrate this with the funding rates example. We define $Y_i$ to be 1 if person $i$ received funding and 0 otherwise, and $x_i$ to be 1 for person $i$ is a woman and 0 if they are a man. For this data, the expected value of $Y_i$ is the probability of funding for person $i$ $\mbox{Pr}(Y_i=1)$. We assume the outcomes $Y_i$ are binomial, with $N=1$ and probability $p_i$. For binomial data, the most widely used transformation is the logit function $g(p) = \log \{p / (1-p)\}$, which takes numbers between 0 and 1 to any continuous number. The model looks like this: $$ @@ -185,7 +185,7 @@ $$ ### The odds ratio {#sec-odds-ratio} -To understand how $\beta_1$ can be used to quantify the effect of being a woman on success rates, first note that $\mbox{Pr}(Y_i=1)/\{1-\mbox{Pr}(Y_i=1)\} = \mbox{Pr}(Y_i=1)/\mbox{Pr}(Y_i=0)$ is the _odds_ of person $i$ getting funding: the ratio of the probability of success and probability of failure. This implies that $e^{\beta_0}$ is the odds for men and $e^{\beta_0}e^{\beta_1}$ is the odds for women, which implies $e^{\beta_1}$ is the odds for women divided by the odds for men. This quantity is called the _odds ratio_. To see this not that if use $p_1$ and $p_0$ to denote the probability of success for women and men, respectively, then $e^\{beta_1$ can be rewritten as +To understand how $\beta_1$ can be used to quantify the effect of being a woman on success rates, first note that $\mbox{Pr}(Y_i=1)/\{1-\mbox{Pr}(Y_i=1)\} = \mbox{Pr}(Y_i=1)/\mbox{Pr}(Y_i=0)$ is the _odds_ of person $i$ getting funding: the ratio of the probability of success and probability of failure. This implies that $e^{\beta_0}$ is the odds for men and $e^{\beta_0}e^{\beta_1}$ is the odds for women, which implies $e^{\beta_1}$ is the odds for women divided by the odds for men. This quantity is called the _odds ratio_. To see this, note that if use $p_1$ and $p_0$ to denote the probability of success for women and men, respectively, then $e^\{beta_1$ can be rewritten as: $$ e^{\beta_1} = \frac{p_1}{1-p_1} \, / \, \frac{p_0}{1-p_0} @@ -194,12 +194,12 @@ $$ $\beta_1$ therefore quantifies the _log odds ratio_. -Now how do we estimate these parameters? Although the details are not described in this book, least squares is no longer an optimal way of estimating the parameters and instead we use an approach called _maximum likelihood estimation_ (MLE). More advanced mathematical derivations show that a version of the central limit theorem applies and the estimates obtained this way are approximately normal when th number of observations is large. The theory also provides a way to calculate standard errors for the estimates of the $\beta$s. +Now how do we estimate these parameters? Although the details are not described in this book, least squares is no longer an optimal way of estimating the parameters and instead we use an approach called _maximum likelihood estimation_ (MLE). More advanced mathematical derivations show that a version of the central limit theorem applies, and the estimates obtained this way are approximately normal when the number of observations is large. The theory also provides a way to calculate standard errors for the estimates of the $\beta$s. ### Fitting the model -To obtain the maximum likelihood estimates using R we can use the `glm` function with the `family` argument set to `binomial`. This defaults to using the logit transformation. Note that we do not have the individual level data, but because we our model assumes the probability of success is the same for all women and all men, then the number of success can be modeled as binomial with $N_1$ trials and probability $p_1$ for women and binomial with $N_0$ trials and probability $p_0$ for men, with $N_1$ and $N_0$ the total number of women and men. In this case the `glm` function is used like this: +To obtain the maximum likelihood estimates using R, we can use the `glm` function with the `family` argument set to `binomial`. This defaults to using the logit transformation. Note that we do not have the individual level data, but because our model assumes the probability of success is the same for all women and all men, then the number of success can be modeled as binomial with $N_1$ trials and probability $p_1$ for women and binomial with $N_0$ trials and probability $p_0$ for men, where $N_1$ and $N_0$ are the total number of women and men. In this case, the `glm` function is used as follows: ```{r} success <- with(totals, c(yes_men, yes_women)) @@ -209,7 +209,7 @@ fit <- glm(cbind(success, failure) ~ gender, family = "binomial") coefficients(summary(fit)) ``` -The estimate of the odds ratio is `r exp(fit$coef[2])` which is interpreted as the odds being lowered by 20% for women as compared to men. But is this due to chance? We already noted that the p-value is about 0.05, but the GLM approach also permits us to compute confidence intervals using the `confint` function. To show the interval for the more interpretable odds ratio we simply exponentiate: +The estimate of the odds ratio is `r exp(fit$coef[2])`, interpreted as the odds being lowered by 20% for women compared to men. But is this due to chance? We already noted that the p-value is about 0.05, but the GLM approach also permits us to compute confidence intervals using the `confint` function. To show the interval for the more interpretable odds ratio, we simply exponentiate: ```{r} #| message: false @@ -217,14 +217,14 @@ exp(confint(fit, 2)) ``` :::{.callout-note} -We have used a simple version of GLMs in which the only variable is binary. However, the method can be expanded to use multiple variables including continuous ones. However, in these contexts the log odds ratio interpretation becomes more complex. Also note that we have shown just one version of GLM appropriate for binomial data using a logit transformation. This version is referred to often referred to as _logistic regression_. However, GLM can be used with other transformation and distributions. You can learn more by consulting a GLM text book. +We have used a simple version of GLMs in which the only variable is binary. However, the method can be expanded to incorporate multiple variables, including continuous ones. In these contexts, the log odds ratio interpretation becomes more complex. Also, note that we have shown just one version of GLM appropriate for binomial data using a logit transformation. This version is often referred to as _logistic regression_. Nevertheless, GLM can be used with other transformation and distributions. You can learn more by consulting a GLM textbook. ::: ### Simple standard error approximation for two-by-two table odds ratio -Using `glm` we can obtain estimates, standard errors, and confidence intervals for a wide range of models. To do this we use a rather complex algorithms. In the case of two-by-two tables we can obtain a standard error for the log odds ratio using a simple approximation. +Using `glm`, we can obtain estimates, standard errors, and confidence intervals for a wide range of models. To do this, we use a rather complex algorithms. In the case of two-by-two tables. we can obtain a standard error for the log odds ratio using a simple approximation. -If our two-by-two tables has the following entries: +FIX SEE WHAT FOLLOWS If our two-by-two tables have the following entries: ```{r, echo=FALSE} mat <- cbind(c(" a "," c "), c(" b "," d ")) @@ -239,7 +239,7 @@ if(knitr::is_html_output()){ } ``` -In this case, the odds ratio is simply $\frac{a/c}{b/d} = \frac{ad}{bc}$. We can confirm we obtain the same estimate as when using `glm`: +In this case, the odds ratio is simply $\frac{a/c}{b/d} = \frac{ad}{bc}$. We can confirm that we obtain the same estimate as when using `glm`: ```{r} two_by_two <- with(totals, data.frame(awarded = c("no", "yes"), @@ -251,7 +251,7 @@ c(log(or), fit$coef[2]) ``` -Statistical theory tells us that when all four entries of the two-by-two table are large enough, then the log odds ratio is approximately normal with standard error +Statistical theory tells us that when all four entries of the two-by-two table are large enough, the log odds ratio is approximately normal with standard error: $$ \sqrt{1/a + 1/b + 1/c + 1/d} @@ -263,9 +263,9 @@ $$ \log\left(\frac{ad}{bc}\right) \pm 1.96 \sqrt{1/a + 1/b + 1/c + 1/d} $$ -By exponentiating these two numbers we can construct a confidence interval of the odds ratio. +By exponentiating these two numbers, we can construct a confidence interval of the odds ratio. -Using R we can compute this confidence interval as follows: +Using R, we can compute this confidence interval as follows: ```{r} se <- two_by_two |> select(-awarded) |> @@ -274,22 +274,22 @@ se <- two_by_two |> select(-awarded) |> exp(log(or) + c(-1,1) * qnorm(0.975) * se) ``` -Note that 1 is not included in the confidence interval which must mean that the p-value is smaller than 0.05. We can confirm this using: +Note that 1 is not included in the confidence interval, implying that the p-value is smaller than 0.05. We can confirm this using: ```{r} 2*(1 - pnorm(abs(log(or)), 0, se)) ``` :::{.callout-warning} -Note that the p-values obtained with `chisq.test`, `glm` and this simple approximation are all slightly different. +Keep in mind that the p-values obtained with `chisq.test`, `glm` and this simple approximation are all slightly different. This is because these are both based on different approximation approaches. ::: ## Large samples, small p-values -As mentioned earlier, reporting only p-values is not an appropriate way to report the results of data analysis. In scientific journals, for example, some studies seem to overemphasize p-values. Some of these studies have large sample sizes and report impressively small p-values. Yet when one looks closely at the results, we realize odds ratios are quite modest: barely bigger than 1. In this case the difference may not be *practically significant* or *scientifically significant*. +As mentioned earlier, reporting only p-values is not an appropriate way to report the results of data analysis. In scientific journals, for example, some studies seem to overemphasize p-values. Some of these studies have large sample sizes and report impressively small p-values. Yet by looking closely at the results, we realize that the odds ratios are quite modest: barely bigger than 1. In this case, the difference may not be *practically significant* or *scientifically significant*. -Note that the relationship between odds ratio and p-value is not one-to-one. It depends on the sample size. So a very small p-value does not necessarily mean a very large odds ratio. Notice what happens to the p-value if we multiply our two-by-two table by 10, which does not change the odds ratio: +Note that the relationship between odds ratio and p-value is not one-to-one; it depends on the sample size. Therefore, a very small p-value does not necessarily mean a very large odds ratio. Observe what happens to the p-value if we multiply our two-by-two table by 10, which does not change the odds ratio: ```{r} two_by_two_x_10 <- two_by_two |> @@ -300,12 +300,12 @@ chisq.test(two_by_two_x_10)$p.value :::{.callout-note} -Note that the log odds ratio is not defined if any of the cells of the two-by-two table is 0. This is because if $a$, $b$, $c$, or $d$ is 0, the $\log(\frac{ad}{bc})$ is either the log of 0 or has a 0 in the denominator. For this situation, it is common practice to avoid 0s by adding 0.5 to each cell. This is referred to as the *Haldane--Anscombe correction* and has been shown, both in practice and theory, to work well. +Also, note that the log odds ratio is not defined if any of the cells of the two-by-two table is 0. This is because if $a$, $b$, $c$, or $d$ are 0, the $\log(\frac{ad}{bc})$ is either the log of 0 or has a 0 in the denominator. For this situation, it is common practice to avoid 0s by adding 0.5 to each cell. This is referred to as the *Haldane--Anscombe correction* and has been shown, both in practice and theory, to work well. ::: ## Exercises -1\. A famous athlete has an impressive career, winning 70% of her 500 career matches. However, this athlete gets criticized because in important events, such as the Olympics, she has a losing record of 8 wins and 9 losses. Perform a Chi-square test to determine if this losing record can be simply due to chance as opposed to not performing well under pressure. +1\. A famous athlete boasts an impressive career, winning 70% of her 500 career matches. Nevertheless, this athlete is criticized because in important events, such as the Olympics, she has a losing record of 8 wins and 9 losses. Perform a Chi-square test to determine if this losing record can be simply due to chance as opposed to not performing well under pressure. 2\. Why did we use the Chi-square test instead of Fisher's exact test in the previous exercise? @@ -316,18 +316,18 @@ d. Because the Chi-square test runs faster. 3\. Compute the odds ratio of "losing under pressure" along with a confidence interval. -4\. Notice that the p-value is larger than 0.05 but the 95% confidence interval does not include 1. What explains this? +4\. Notice that the p-value is larger than 0.05, but the 95% confidence interval does not include 1. What explains this? a. We made a mistake in our code. b. These are based on t-statistics so the connection between p-value and confidence intervals does not apply. -c. Different approximations are used for the p-value and the confidence interval calculation. If we had a larger sample size the match would be better. +c. Different approximations are used for the p-value and the confidence interval calculation. If we had a larger sample size, the match would be better. d. We should use the Fisher exact test to get confidence intervals. 5\. Multiply the two-by-two table by 2 and see if the p-value and confidence retrieval are a better match. -6\. Use the `research_funding_rates` data to estimate the log odds ratio along and standard errors comparing women to men for each discipline. Compute a confidence interval and report all the disciplines for which one gender appears to be favored over the other. +6\. FIX Use the `research_funding_rates` data to estimate the log odds ratio along and standard errors comparing women to men for each discipline. Compute a confidence interval and report all the disciplines for which one gender appears to be favored over the other. -7\. Divide the log odds ratio estimates by their respective standard errors and generate a qq-plot comparing these to a standard normal. Do any of the discplines deviate from what is expected by chance? +7\. Divide the log odds ratio estimates by their respective standard errors and generate a qq-plot comparing these to a standard normal. Do any of the disciplines deviate from what is expected by chance? 8\. During the 2016 US presidential election, then candidate Donald J. Trump used his twitter account as a way to communicate with potential voters. @@ -346,6 +346,6 @@ Compute an odds ratio comparing Android to iPhone for each sentiment and add it 10\. Generate a plot showing the estimated odds ratios along with their confidence intervals. -11\. Test the null hypothesis that there is no difference between tweets from Android and iPhone and report the sentiments with p-values less than 0.05 and more likely to come from Android. +11\. FIX Test the null hypothesis that there is no difference between tweets from Android and iPhone and report the sentiments with p-values less than 0.05 and more likely to come from Android. 12\. For each sentiment, find the words assigned to that sentiment, keep words that appear at least 25 times, compute the odd ratio for each, and show a barplot for those with odds ratio larger than 2 or smaller than 1/2. diff --git a/linear-models/measurement-error-models.qmd b/linear-models/measurement-error-models.qmd index 95a43b2..d35c784 100644 --- a/linear-models/measurement-error-models.qmd +++ b/linear-models/measurement-error-models.qmd @@ -1,6 +1,6 @@ # Measurement error models -Another major application of linear models arises in measurement errors models. In these scenarios, it is common to have a non-random covariate, such as time, and randomness is introduced from measurement error rather than sampling or natural variability. +Another major application of linear models occurs in measurement errors models. In these situations, non-random covariates, such as time, are frequently encountered, with randomness often arising from measurement errors rather than from sampling or inherent natural variability. To understand these models, imagine you are Galileo in the 16th century trying to describe the velocity of a falling object. An assistant climbs the Tower of Pisa and drops a ball, while several other assistants record the position at different times. Let's simulate some data using the equations we currently know and adding some measurement error. The **dslabs** function `rfalling_object` generates these simulations: @@ -33,7 +33,7 @@ $$ Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \varepsilon_i, i=1,\dots,n $$ -with $Y_i$ representing distance in meters, $x_i$ representing time in seconds, and $\varepsilon$ accounting for measurement error. The measurement error is assumed to be random, independent from each other, and having the same distribution for each $i$. FIX We also assume that there is no bias, which means the expected value $\mbox{E}[\varepsilon] = 0$. +with $Y_i$ representing distance in meters, $x_i$ representing time in seconds, and $\varepsilon$ accounting for measurement error. The measurement error is assumed to be random, independent from each other, and having the same distribution for each $i$. We also assume that there is no bias, which means the expected value $\mbox{E}[\varepsilon] = 0$. Note that this is a linear model because it is a linear combination of known quantities ($x$ and $x^2$ are known) and unknown parameters (the $\beta$s are unknown parameters to Galileo). Unlike our previous examples, here $x$ is a fixed quantity; we are not conditioning. diff --git a/linear-models/treatment-effect-models.qmd b/linear-models/treatment-effect-models.qmd index 46bd31f..6d5754f 100644 --- a/linear-models/treatment-effect-models.qmd +++ b/linear-models/treatment-effect-models.qmd @@ -10,12 +10,13 @@ library(broom) library(dslabs) ``` -Up to now, all our linear models have been applied to two or more continuous random variables. We assume the random variables are multivariate normal and use this to motivate a linear model. This approach covers many real-life examples of linear regression. However, linear models have many other applications. One of the most popular is to quantify treatment effects in randomized and controlled experiments. One of the first applications was in agriculture, where different plots of lands were treated with different combinations of fertilizers to try to determine if they were effective. In fact the use of $Y$ for the outcome in Statistics, is due to the mathematical theory being developed for crop _yield_ as the outcome. +Up to now, all our linear models have been applied to two or more continuous random variables. We assume the random variables are multivariate normal and use this to motivate a linear model. This approach covers many real-life examples of linear regression. However, linear models have many other applications. One of the most popular is to quantify treatment effects in randomized and controlled experiments. One of the first applications was in agriculture, where different plots of lands were treated with different combinations of fertilizers to try to determine if they were effective. In fact, the use of $Y$ for the outcome in statistics is due to the mathematical theory being developed for crop _yield_ as the outcome. -Since, the same ideas have been applied in other areas, such as randomized trials developed to determine if drugs cure or prevent a diseases or if policies have an effect on social or educational outcomes. In the latter example, we think of the policy intervention as a _treatment_ and follow the same mathematical procedure. The analyses used in _A/B testing_, widely used today by internet companies, are based on treatment effects models. -Furthermore, the use of these models has been extended to observational studies where analysts attempt to use linear models to estimate effects of interest while accounting for potential confounders. For example, to estimate the effect of a diet high on fruits and vegetables on blood pleasure, we would have to adjust for _factors_ such as age, sex, and smoking status. +Since then, the same ideas have been applied in other areas, such as randomized trials developed to determine if drugs cure or prevent diseases or if policies have an effect on social or educational outcomes. In the latter example, we think of the policy intervention as a _treatment_ and follow the same mathematical procedure. The analyses used in _A/B testing_, widely used today by internet companies, are based on treatment effects models. -In this chapter we consider an experiment testing for the effect of a high-fat diet on mouse physiology. Mice were selected and divided at random into two groups, one group receiving a high-fat diet, considered the _treatment_ and the other group left as control and receiving the usual _chow_ diet. The data is included in the **dslabs** package: +Moreover, these models have been applied in observational studies where analysts attempt to use linear models to estimate effects of interest while accounting for potential confounders. For example, to estimate the effect of a diet high in fruits and vegetables on blood pressure, we would have to adjust for _factors_ such as age, sex, and smoking status. + +In this chapter, we consider an experiment designed to test for the effects of a high-fat diet on mouse physiology. Mice were randomly selected and divided into two groups: one group receiving a high-fat diet, considered the _treatment_, while the other group served as the control and received the usual _chow_ diet. The data is included in the **dslabs** package: ```{r} library(dslabs) @@ -28,7 +29,7 @@ A boxplot shows that the high fat diet mice are, on average, heavier. with(mice_weights, boxplot(body_weight ~ diet)) ``` -But, given that we divided the mice at random, is it possible the observed difference is simply due to chance? Here, we can compute the sample average and standard deviation of each group and perform statistical inference on the difference of these means, similar to what we did for election forecasting in Chapters @sec-hypothesis-testing and @sec-models. +However, given that we divided the mice randomly, is it possible that the observed difference is simply due to chance? Here, we can compute the sample average and standard deviation of each group and perform statistical inference on the difference of these means, similar to our approach for election forecasting in @sec-hypothesis-testing and @sec-models. ## Comparing group means @@ -39,16 +40,16 @@ library(tidyverse) mice_weights |> group_by(diet) |> summarize(average = mean(body_weight)) ``` -But this is a random sample of mice and the assignment to the diet group is also random. So is this difference due to chance? We will use hypothesis testing, first described in Chapter @sec-hypothesis-testing, to answer this question. +However, this is a random sample of mice, and the assignment to the diet group is also done randomly. So is this difference due to chance? We will use hypothesis testing, first described in @sec-hypothesis-testing, to answer this question. -Denote with $\mu_1$ and $\sigma_1$ the weight average and standard deviation we would observe if the entire population of mice were on the high-fat diet. Define $\mu_0$ and $\sigma_0$ similarly for the chow diet. Define $N_1$ and $N_0$ as the sample sizes, let's call them $\bar{X}_1$ and $\bar{X}_0$ as the sample averages, and $s_1$ and $s_0$ the sample standard deviations for the for the high-fat and chow diets, respectively. Because this is a random sample the central limit theorem tells us that the difference in averages $bar{X}_1 - \bar{X}_0$ follows a normal distribution with expected value $\mu_1-\mu_0$ and standard error $\sqrt{\frac{s_1^2}{N_1} + \frac{s_0^2}{N_0}}$. If we define the null hypothesis as the high-fat diet having no effect, or $\mu_1 - \mu_0 = 0$, the the following summary statistic +Let $\mu_1$ and $\sigma_1$ represent the weighted average and standard deviation, respectively, that we would observe if the entire population of mice were on the high-fat diet. Define $\mu_0$ and $\sigma_0$ similarly for the chow diet. FIX UNCLEAR Define $N_1$ and $N_0$ as the sample sizes, let's call them $\bar{X}_1$ and $\bar{X}_0$ as the sample averages, and $s_1$ and $s_0$ the sample standard deviations for the for the high-fat and chow diets, respectively. Since this is a random sample, the central limit theorem tells us that the difference in averages $bar{X}_1 - \bar{X}_0$ follows a normal distribution, with expected value $\mu_1-\mu_0$ and standard error $\sqrt{\frac{s_1^2}{N_1} + \frac{s_0^2}{N_0}}$. If we define the null hypothesis as the high-fat diet having no effect, or $\mu_1 - \mu_0 = 0$, the following summary statistic: $$ t = \frac{\bar{X}_1 - \bar{X}_0}{\sqrt{\frac{s_1^2}{N_1} + \frac{s_0^2}{N_0}}} $$ -follows a standard normal distribution when the null hypothesis is true, which implies we can easily compute the probability of observing a value as large as the one we did: +follows a standard normal distribution when the null hypothesis is true. This implies that we can easily compute the probability of observing a value as large as the one we obtained: ```{r} stats <- mice_weights |> group_by(diet) |> summarize(xbar = mean(body_weight), s = sd(body_weight), n = n()) @@ -58,58 +59,58 @@ t_stat Here $t$ is well over 3, so we don't really need to compute the p-value `1-pnorm(t_stat)` as we know it will be very small. -Note that when $N$ is not large, then the CLT does not apply. However, if the outcome data, in this case weight, follows a normal distribution, then $t$ follows a t-distribution with $N_1+N_2-2$ degrees of freedom. So the calculation of the p-value is the same except we use `1-pt(t_stat, with(stats, n[2]+n[1]-2)` to compute the p-value. +Note that when $N$ is not large, then the CLT does not apply. However, if the outcome data, in this case weight, follows a normal distribution, then $t$ follows a t-distribution with $N_1+N_2-2$ degrees of freedom. So the calculation of the p-value is the same except that we use `1-pt(t_stat, with(stats, n[2]+n[1]-2)` to compute the p-value. -Because using differences in mean are so common in scientific studies, this _t-statistic_ is one of the most widely reported summaries. When use it in a hypothesis testing setting, it is referred to a performing a _t test_. +Given the common use of differences in means in scientific studies, this _t-statistic_ is one of the most widely reported summaries. FIX When use it in a hypothesis testing setting, it is referred to as performing a _t test_. :::{.callout-warning} -In the computation above we computed the probability of `t` being as large as what we observed. However, when we are equally interested in both directions, for example, either an increase or decrease in weight, then we need to compute the probability of `t` being _as extreme_ as what we observe. The formula simply changes to using the absolute value: `1 - pnorm(abs(t-test))` or `1-pt(t_stat, with(stats, n[2]+n[1]-2)`. +In the computation above, we computed the probability of `t` being as large as what we observed. However, when our interest spans both directions, for example, either an increase or decrease in weight, we need to compute the probability of `t` being _as extreme_ as what we observe. The formula simply changes to using the absolute value: `1 - pnorm(abs(t-test))` or `1-pt(t_stat, with(stats, n[2]+n[1]-2)`. ::: ## One factor design -Although the t-test is useful for cases in which we only account for two treatments, it is common to have other variables affect our outcomes. Linear models permit hypothesis testing in more general situations. We start the description of the use linear models for estimating treatment effects by demonstrating how they can be used to perform t-tests. +Although the t-test is useful for cases in which we only account for two treatments, it is common to have other variables affect our outcomes. Linear models permit hypothesis testing in these more general situations. We start the description of the use of linear models for estimating treatment effects by demonstrating how they can be used to perform t-tests. If we assume that the weight distributions for both chow and high-fat diets are normally distributed, we can write the following linear model to represent the data: $$ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i $$ -with $X_i$ 1 if the $i$-th mice was fed the high-fat diet and 0 otherwise and the errors $\varepsilon_i$ independent and normally distributed with expected value 0 and standard deviation $\sigma$. Note that this mathematical formula looks exactly like the model we wrote out for the father-son heights. However, the fact that $x_i$ is now 0 or 1 rather than a continuous variable, permits us to use it in this different context. In particular notice that now $\beta_0$ represents the population average height of the mice on the chow diet and $\beta_0 + \beta_1$ represents the population average for the weight of the mice on the high-fat diet. +with $X_i$ 1, if the $i$-th mice was fed the high-fat diet, and 0 otherwise, and the errors $\varepsilon_i$ independent and normally distributed with expected value 0 and standard deviation $\sigma$. Note that this mathematical formula looks exactly like the model we wrote out for the father-son heights. However, the fact that $x_i$ is now 0 or 1 rather than a continuous variable, allows us to use it in this different context. In particular, notice that now $\beta_0$ represents the population average height of the mice on the chow diet and $\beta_0 + \beta_1$ represents the population average for the weight of the mice on the high-fat diet. -A nice feature of this model is that $\beta_1$ represents the _treatment effect_ of receiving the high-fat diet. If the null hypothesis that the high-fat diet has no effect can be quantified as $\beta_1 = 0$. We can then estimate $\beta_1$ and answer the question of weather or not the observed difference is real by computing the estimates being as large as it was under the null. So how do we estimate $\beta_1$ and a standard error for the estimate? +A nice feature of this model is that $\beta_1$ represents the _treatment effect_ of receiving the high-fat diet. FIX If the null hypothesis that the high-fat diet has no effect can be quantified as $\beta_1 = 0$. FIX We can then estimate $\beta_1$ and answer the question of weather or not the observed difference is real by computing the estimates being as large as it was under the null. So how do we estimate $\beta_1$ and a standard error for the estimate? -A powerful characteristics of linear models is that we can can estimate the parameters $\beta$s and their standard errors with the same LSE machinery: +A powerful characteristic of linear models is that we can estimate the parameters $\beta$s and their standard errors with the same LSE machinery: ```{r} fit <- lm(body_weight ~ diet, data = mice_weights) ``` -Because `diet` is a factor with two entries, the `lm` function knows to fit the model above with a $x_i$ a indicator variable. The `summary` function shows us the resulting estimate, standard error, and p-value: +FIX Because `diet` is a factor with two entries, the `lm` function knows to fit the model above with a $x_i$ a indicator variable. The `summary` function shows us the resulting estimate, standard error, and p-value: ```{r} coefficients(summary(fit)) ``` -or using `broom` we can write: +or, using `broom`, we can write: ```{r} library(broom) tidy(fit, conf.int = TRUE) |> filter(term == "diethf") ``` -The `statistic` computed here is the estimate divided by its standard error: $\hat{\beta}_1 / \hat{\mbox{SE}}(\hat{\beta}_1)$. In the case for the simple one-factor model, we can show that this statistic is almost equivalent to the t-test. Intuitively it makes since both $\hat{\beta_1}$ and the numerator of the t-test are estimates of the treatment effect. In fact, we can see that we obtain a number similar to the $t$ computed in the previous section. +The `statistic` computed here is the estimate divided by its standard error: $\hat{\beta}_1 / \hat{\mbox{SE}}(\hat{\beta}_1)$. In the case of the simple one-factor model, we can show that this statistic is almost equivalent to the t-test. Intuitively, it makes sense, as both $\hat{\beta_1}$ and the numerator of the t-test are estimates of the treatment effect. In fact, we can observe that the obtained number is similar to the $t$ computed in the previous section. ```{r} c(coefficients(summary(fit))[2,3], t_stat) ``` -One minor difference is that the linear model does not assume a different standard deviation for each population. Instead, both populations share $\mbox{SD}(\varepislon)$ as a standard deviation. Note that, although we do not show how to do it with R here, we can redefine the linear model to have different standard errors for each group. +One minor difference is that the linear model does not assume a different standard deviation for each population. Instead, both populations share $\mbox{SD}(\varepislon)$ as a standard deviation. Note that, although we don't demonstrate it with R here, we can redefine the linear model to have different standard errors for each group. :::{.callout-note} -In the linear model description provided here we assumed $\varepsilon$ follows a normal distribution. This assumption permits us to show that the statistics formed by dividing estimates by their estimated standard errors follow t-distribution, which in turn permits us to estimate p-values or confidence intervals. However, note that we do not need this assumption to compute the expected value and standard error of the least squared estimates. Furthermore, if the number of observations in large enough, then the central limit theorem applies and we can obtain p-values and confidence intervals even without the normal distribution assumption. +In the linear model description provided here, we assumed $\varepsilon$ follows a normal distribution. This assumption permits us to show that the statistics formed by dividing estimates by their estimated standard errors follow t-distribution, which in turn allows us to estimate p-values or confidence intervals. However, note that we do not need this assumption to compute the expected value and standard error of the least squared estimates. Furthermore, if the number of observations is large enough, then the central limit theorem applies and we can obtain p-values and confidence intervals even without the normal distribution assumption. ::: ## Two factor designs @@ -120,7 +121,7 @@ Note that this experiment included male and female mice, and male mice are known boxplot(fit$residuals ~ mice_weights$sex) ``` -This misspecification can have real implications since if more male mice received the high-fat diet, then this could explain the increase. Or if less received it, then we might underestimate the diet effect. Sex might be a confounder. Our model can certainly be improved. +This misspecification can have real implications; for instance, if more male mice received the high-fat diet, then this could explain the increase. Conversely, if fewer received it, we might underestimate the diet effect. Sex could be a confounder, indicating that our model can certainly be improved. From examining the data: @@ -128,24 +129,24 @@ From examining the data: mice_weights |> ggplot(aes(diet, log2(body_weight), fill = sex)) + geom_boxplot() ``` -we see that there diet effect is observed for both sexes and that males are heavier than females. Although not nearly as obvious, it also appears the diet effect is stronger in males. A linear models that permits a different expected value four groups, 1) female on chow diet, 2) females on high-fat diet, 3) male on chow diet, and 4)males on high-fat diet, +we see that the diet effect is observed for both sexes and that males are heavier than females. Although not nearly as obvious, it also appears the diet effect is stronger in males. FIX A linear model that permits a different expected value four groups, 1) female on chow diet, 2) females on high-fat diet, 3) male on chow diet, and 4 )males on high-fat diet, $$ Y_i = \beta_1 x_{i,1} + \beta_2 x_{i,2} + \beta_3 x_{i,3} + \beta_4 x_{i,4} + \varepsilon_i $$ -with the $x_i$s indicator variables for each of the four groups. However, with this representation, none of the $\beta$s represent the effect of interest: the diet effect. Furthermore, we now are accounting for the possibility that the diet effect is different for males and females have a different, and can test that hypothesis as well. +with the $x_i$s indicator variables for each of the four groups. However, with this representation, none of the $\beta$s represent the effect of interest: the diet effect. FIX Furthermore, we are now accounting for the possibility that the diet effect is different for males and females have a different, and we can test that hypothesis as well. -A powerful feature of linear models is that we can rewrite the model so that we still have a different expected value for each group, but the parameters represent effects we are interested. So, for example, in the representation +A powerful feature of linear models is that we can rewrite the model so that we still have a different expected value for each group, but the parameters represent the effects we are interested in. So, for example, in the representation: $$ Y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \beta_3 x_{i,1} x_{i,2} + \varepsilon_i $$ -with $x_{i,1}$ and indicator that is one if you have the treatment and $x_{i,2}$ an indicator that is one if you are male, the $\beta_1$ can be interpreted as the treatment effect for females, $\beta_2$ as the difference between males and females, and $\beta_3$ the added treatment effect for males. In Statistics, this last effect is referred to as an _interaction_ effect. The $\beta_0$ is consider the baseline value which is the average weight of females on the chow diet. +FIX with $x_{i,1}$ an indicator that is one if you have the treatment and $x_{i,2}$ an indicator that is one if you are male, the $\beta_1$ can be interpreted as the treatment effect for females, $\beta_2$ as the difference between males and females, and $\beta_3$ the added treatment effect for males. In statistics, this last effect is referred to as an _interaction_ effect. The $\beta_0$ is considered the baseline value, which is the average weight of females on the chow diet. Statistical textbooks describes several other ways in which the model can be rewritten to obtain other types of interpretations. For example, we might want $\beta_2$ to represent the average treatment effect between females and males, rather that the female treatment effects. This is achieved by defining what _contrasts_ we are interested. -In R we can specific this model using the following +In R, we can specific this model using the following: ```{r} fit <- lm(body_weight ~ diet*sex, data = mice_weights) @@ -160,35 +161,35 @@ tidy(fit, conf.int = TRUE) |> filter(!str_detect(term, "Intercept")) Note that the male effect is larger that the diet effect, and the diet effect is statistically significant for both sexes, with the males having a higher effect by between 1 and 4.5 grams. -A common approach applied when more than one factor is thought to affect the measurement is to simply include an additive effect for each factor like this: +A common approach applied when more than one factor is thought to affect the measurement is to simply include an additive effect for each factor, like this: $$ Y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \varepsilon_i $$ -In this model, the $\beta_1$ is a general diet effect that applies regardless of sex. In R we use the following code using a `+` instead of `*`: +In this model, the $\beta_1$ is a general diet effect that applies regardless of sex. In R, we use the following code, employing a `+` instead of `*`: ```{r} fit <- lm(body_weight ~ diet + sex, data = mice_weights) ``` -Because their a strong interaction effect, a diagnostic plots shows that the residuals are biased: the average negative for females on the diet and positive for the males on the diet, rather than 0. +FIX Because their a strong interaction effect, a diagnostic plots shows that the residuals are biased: the average negative for females on the diet and positive for the males on the diet, rather than 0. ```{r lm-diagnostic-plot} plot(fit, which = 1) ``` -Scientific studies, particularly within epidemiology and social sciences, frequently omit interaction terms from models due to the high number of variables. Adding interactions necessitates numerous parameters, which, in extreme cases, may prevent the model from fitting. However, this approach assumes that the interaction terms are zero, which, if incorrect, can skew the results' interpretation. Conversely, when this assumption is valid, models excluding interactions are simpler to interpret as parameters are typically viewed as the extent to which the outcome increases with the assigned treatment. +Scientific studies, particularly within epidemiology and social sciences, frequently omit interaction terms from models due to the high number of variables. Adding interactions necessitates numerous parameters, which in extreme cases may prevent the model from fitting. However, this approach assumes that the interaction terms are zero, and if incorrect, it can skew the interpretation of the results. Conversely, when this assumption is valid, models excluding interactions are simpler to interpret, as parameters are typically viewed as the extent to which the outcome increases with the assigned treatment. :::{.callout-tip} -Linear models are very flexible and can be applied in many contexts. For example, we can include many more factors than 2. We have just scratched the surface of how linear models can be used to estimate treatment effects. We highly recommend learning more about this through linear model textbooks and R manuals on using the `lm`, `contrasts`, and `model.matrix` functions. +Linear models are highly flexible and applicable in many contexts. For example, we can include many more factors than just 2. We have only just scratched the surface of how linear models can be used to estimate treatment effects. We highly recommend learning more about this by exploring linear model textbooks and R manuals that cover the use of functions such as `lm`, `contrasts`, and `model.matrix`. ::: ## Contrasts -In the examples we have examined, each treatment had only two group: diet and chow and high-fat and sex had female and male. However, variables of interest often have more than one level. For example, we might have tested a third diet on the mice. In Statistics textbooks these variables are referred to as _factor_ and the groups in each factor are called its _levels_. +In the examples we have examined, each treatment had only two groups: diet and chow, and high-fat and sex had female and male. However, variables of interest often have more than one level. For example, we might have tested a third diet on the mice. In statistics textbooks, these variables are referred to as a _factor_, and the groups in each factor are called its _levels_. -When a factor is included in the formula, the default behavior for `lm` is to define the intercept term as the expected value for the first level and the other coefficient are to represent the difference, or _contrast_ between the other levels and first. We can see this if we fit estimate the sex effect with `lm`: +When a factor is included in the formula, the default behavior for `lm` is to define the intercept term as the expected value for the first level, and the other coefficient are to represent the difference, or _contrast_, between the other levels and first. FIX We can see this if we fit estimate the sex effect with `lm`: ```{r} @@ -196,13 +197,13 @@ fit <- lm(body_weight ~ sex, data = mice_weights) coefficients(fit) ``` -To recover the expected mean for males we can simply add the two coefficients: +To recover the expected mean for males, we can simply add the two coefficients: ```{r} sum(fit$coefficients[1:2]) ``` -The package **emmeans** makes this calculation simple and it also calculates standard errors: +The package **emmeans** simplifies the calculation and also calculates standard errors: ```{r} library(emmeans) @@ -215,9 +216,9 @@ Now, what if we really didn't want to define a reference level? What if we wante $$ Y_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \varepsilon_i $$ -with $x_{i,1} = 1$ if observation $i$ is female and 0 otherwise, and $x_{i,2}=1$ if observation $i$ is male and 0 otherwise? Unfortunately, this representation has a problem. Note that the mean for females and males are represented by $\beta_0 + \beta_1$ and $\beta_0 + \beta_2$, respectively. This is problem because the expected value for each group is just on number, say $\mu_f$ and $\mu_m$, and there is infinite number of ways $\beta_0 + \beta_1 = \mu_f$ and $\beta_0 +\beta_2 = \mu_m$ (three unknowns with two equations). This implies we can't obtain a unique least squares estimates. In statistics we say the model, or parameters, are _unidentifiable_. The default behavior in R solves this problem by requiring $\beta_1 = 0$, forcing $\beta_0 = \mu_m$, which permits us to solve the system of equations. +with $x_{i,1} = 1$, if observation $i$ is female and 0 otherwise, and $x_{i,2}=1$, if observation $i$ is male and 0 otherwise? Unfortunately, this representation has a problem. Note that the mean for females and males are represented by $\beta_0 + \beta_1$ and $\beta_0 + \beta_2$, respectively. This is a problem because the expected value for each group is just one number, say $\mu_f$ and $\mu_m$, and there is an infinite number of ways $\beta_0 + \beta_1 = \mu_f$ and $\beta_0 +\beta_2 = \mu_m$ (three unknowns with two equations). This implies that we can't obtain a unique least squares estimates. In statistics, we say the model, or parameters, are _unidentifiable_. The default behavior in R solves this problem by requiring $\beta_1 = 0$, forcing $\beta_0 = \mu_m$, which permits us to solve the system of equations. -Notice that this is not the only constrain that permits estimation of the parameters. Any linear constrain will do as it adds a third equation to our system. A widely used constrain is to require $\beta_1 + \beta_2 = 0$. To achieve this in R, we can use the argument `contrast` in the following way: +Keep in mind that this is not the only constraint that permits estimation of the parameters. Any linear constraint will do as it adds a third equation to our system. A widely used constraint is to require $\beta_1 + \beta_2 = 0$. To achieve this in R, we can use the argument `contrast` in the following way: ```{r} @@ -225,15 +226,15 @@ fit <- lm(body_weight ~ sex, data = mice_weights, contrasts = list(sex = contr.s coefficients(fit) ``` -We see that the intercept is now larger because it now represents the overall mean, rather than the mean for females. The other coefficient represents the contrast between females and the overall mean, $\beta_1$ in our model. The coefficient for men is not shown because it is redundant: $\beta_1= -\beta_2$. +We see that the intercept is now larger, reflecting the overall mean rather than just the mean for females. The other coefficient, $\beta_1$, represents the contrast between females and the overall mean in our model. The coefficient for men is not shown because it is redundant: $\beta_1= -\beta_2$. -If we want to see all the estimates, the package **emmeans** also makes the calculations for us: +If we want to see all the estimates, the **emmeans** package also makes the calculations for us: ```{r} contrast(emmeans(fit, ~sex)) ``` -The use of this alternative constrain is more practical when a factor has more than one level and choosing an baseline becomes less convenient. Furthermoe, we might be more interested in the variance of the coefficients rather than contrasts between groups and the reference level. +The use of this alternative constraint is more practical when a factor has more than one level, and choosing an baseline becomes less convenient. Furthermore, we might be more interested in the variance of the coefficients rather than the contrasts between groups and the reference level. As an example, consider that the mice in our dataset are actually from several generations: @@ -242,7 +243,7 @@ table(mice_weights$gen) ``` -To estimate the variability due to the different generations, a convinient model is: +To estimate the variability due to the different generations, a convenient model is: $$ Y_i = \beta_0 + \sum_{j=1}^J \beta_j x_{i,j} + \varepsilon_i @@ -254,27 +255,27 @@ $$ \frac{1}{J} \sum_{j=1}^J \beta_j = 0 \implies \sum_{j=1}^J \beta_j = 0 $$ -This constrain makes the model identifiable and also let's us quantify the variability due to generations with +This constraint makes the model identifiable and also allows us to quantify the variability due to generations with: $$ \sigma^2_{\text{gen}} = \frac{1}{J}\sum_{j=1}^J \beta_j^2 $$ -We can see the estimated coefficients using: +We can see the estimated coefficients using the following: ```{r} fit <- lm(body_weight ~ gen, data = mice_weights, contrasts = list(gen = contr.sum)) contrast(emmeans(fit, ~gen)) ``` -In the next section we briefly describe a technique useful to study the variability associated with this factors. +In the next section, we briefly describe a technique useful to study the variability associated with this factors. ## Analysis of variance (ANOVA) {#sec-anova} -When a factor has more than one level it is common to want to know if there is important variability across the levels rather than specific difference between any given pair of levels. Analysis of variances (ANOVA) provides tools to do this. +When a factor has more than one level, it is common to want to determine if there is significant variability across the levels rather than specific difference between any given pair of levels. Analysis of variances (ANOVA) provides tools to do this. -ANOVA provides an estimate of $\sigma^2_{\text{gen}}$ and statistical approach to test the null hypothesis that the factor contributes no variability: $\sigma^2_{\text{gen}} =0$. +ANOVA provides an estimate of $\sigma^2_{\text{gen}}$ and a statistical test for the null hypothesis that the factor contributes no variability: $\sigma^2_{\text{gen}} =0$. Once a linear model is fit using one or more factors, the R `aov` function can be used to perform ANOVA. Specifically, the estimate of the factor variability is computed along with a statistic that can be used for hypothesis testing: @@ -282,25 +283,25 @@ Once a linear model is fit using one or more factors, the R `aov` function can b summary(aov(fit)) ``` -Note that if given a model formula `aov` will fit the model: +Keep in mind that if given a model formula, `aov` will fit the model: ```{r} #| eval: false summary(aov(body_weight ~ gen, data = mice_weights)) ``` -We do not need to specify the constrain because ANOVA needs tto constrain the sum to be 0 for the results to be interpretable. +We do not need to specify the constraint because ANOVA needs to constrain the sum to be 0 for the results to be interpretable. -This analysis points that generation not being significant. +This analysis indicates that generation is not statistically significant. :::{.callout-note} -We do note include many details, for example on how the summary statistics and p-values shown by `aov` are defined and motivated. There are several books on analysis of variance and textbooks on linear models often include chapters on this topic. Those interested in learning more about these topics can consult these textbooks. +We do not include many details, for example, on how the summary statistics and p-values shown by `aov` are defined and motivated. There are several books dedicated to the analysis of variance, and textbooks on linear models often include chapters on this topic. Those interested in learning more about these topics can consult these textbooks. ::: ### Multiple factors -ANOVA was developed to analyze agricultural data, which usually included several factors such as treatment, blocks, and breeds. +ANOVA was developed to analyze agricultural data, which typically included several factors such as treatment, blocks, and breeds. We can perform ANOVA with multiple factors: @@ -309,40 +310,40 @@ We can perform ANOVA with multiple factors: summary(aov(body_weight ~ sex + diet + gen, data = mice_weights)) ``` -This analysis points at sex being the biggest source of variability, which is consistent with previously made exploratory plots. +This analysis suggests that sex is the biggest source of variability, which is consistent with previously made exploratory plots. :::{.callout-warning} -One of the key aspects of ANOVA (Analysis of Variance) is its ability to decompose the total variance in the data, represented by $\sum_{i=1}^n Y_i^2$, into individual contributions attributable to each factor in the study. However, for the mathematical underpinnings of ANOVA to be valid, the experimental design must be balanced. This means that for every level of any given factor, there must be an equal representation of the levels of all other factors. In our study involving mice, the design is unbalanced, which necessitates a cautious approach to the interpretation of the ANOVA results. +One of the key aspects of ANOVA (Analysis of Variance) is its ability to decompose the total variance in the data, represented by $\sum_{i=1}^n Y_i^2$, into individual contributions attributable to each factor in the study. However, for the mathematical underpinnings of ANOVA to be valid, the experimental design must be balanced. This means that for every level of any given factor, there must be an equal representation of the levels of all other factors. In our study involving mice, the design is unbalanced, requiring a cautious approach in the interpretation of the ANOVA results. ::: ### Array representation -When the model includes more than one factor, writing down linear models can become cumbersome. For example, in our two factor model we would have to include indicator variables for both factors: +When the model includes more than one factor, writing down linear models can become cumbersome. For example, in our two factor model, we would have to include indicator variables for both factors: $$ Y_i = \beta_0 + \sum_{j=1}^J \beta_j x_{i,j} + \sum_{k=1}^K \beta_{J+k} x_{i,J+k} + \varepsilon_i $$ -with the $x_{i,1},\dots,x_{i,J}$ indicator functions for the $J$ levels in the first factor and $x_{i,J+1},\dots,x_{i,J+K}$ indicator functions for the $K$ levels in the second factor. Note we would apply a constrain to assure identifiability. +with the $x_{i,1},\dots,x_{i,J}$ indicator functions for the $J$ levels in the first factor and $x_{i,J+1},\dots,x_{i,J+K}$ indicator functions for the $K$ levels in the second factor. Remember that we would apply a constraint to assure identifiability. -An alternative approach, widely used in ANOVA, to avoid indicators variables, is to save the data in an array, then using different Greek letters to denote factors and indices to denote levels: +An alternative approach widely used in ANOVA to avoid indicators variables, is to save the data in an array, using different Greek letters to denote factors and indices to denote levels: $$ Y_{i,j,k} = \mu + \alpha_j + \beta_k + \varepsilon_{i,j,k} $$ -with $\mu$ the overall mean, $\alpha_j$ the effect of level $j$ in the first factor, and $\beta_k$ the effect of level $k$ in the second factor. The constrain can be written as +with $\mu$ the overall mean, $\alpha_j$ the effect of level $j$ in the first factor, and $\beta_k$ the effect of level $k$ in the second factor. The constraint can be written as: $$ \sum_{j=1}^J \alph_a_j = 0 \text{ and } \sum_{k=1}^K \beta_k = 0 $$ -This notation lends itself for extimating the effects by computing means across dimensions of the array. +This notation lends itself to estimating the effects by computing means across dimensions of the array. ## Exercises -1\. Once you fit a model, the estimate of the standard error $\sigma$ can be obtained like this: +1\. Once you fit a model, the estimate of the standard error $\sigma$ can be obtained as follows: ```{r} #| eval: false @@ -351,28 +352,28 @@ fit <- lm(body_weight ~ diet, data = mice_weights) summary(fit)$sigma ``` -Compute the estimate of $\sigma$ using the model that includes just diet and a model that accounts for sex. Are the estimates the same? If not, why not? +Compute the estimate of $\sigma$ using both the model that includes only diet and a model that accounts for sex. Are the estimates the same? If not, why not? -2\. One of the assumption of the linear model fit by `lm` is that the standard deviation of the errors $\varepsilon_i$ is equal for all $i$. This implies it does not depend on the expected value. Group the mice by their weight like this: +2\. One of the assumption of the linear model fit by `lm` is that the standard deviation of the errors $\varepsilon_i$ is equal for all $i$. This implies that it does not depend on the expected value. Group the mice by their weight like this: ```{r} breaks <- with(mice_weights, seq(min(body_weight), max(body_weight), 1)) dat <- mutate(mice_weights, group = cut(body_weight, breaks, include_lowest = TRUE)) ``` -Compute the average and standard deviation for groups having more than 10 observations and use data exploration to see if this assumption holds? +Compute the average and standard deviation for groups with more than 10 observations and use data exploration to verify if this assumption holds. 3\. The dataset also includes a variable indicating which litter the mice came from. Create a boxplot showing weights by litter. Use faceting to make separate plots for each diet and sex combination. -4\. Use a linear model to test for a litter effect. Account for sex and diet. Use ANOVA to compare the variability explained by litter to other factors. +4\. Use a linear model to test for a litter effect, taking into account sex and diet. Use ANOVA to compare the variability explained by litter with that of other factors. -5\. The `mouse_weights` data includes two other outcomes: bone density and percent fat. Make a boxplot showing bone density by sex and diet. Compare what the visualizations shows for the diet effect by sex. +5\. The `mouse_weights` data includes two other outcomes: bone density and percent fat. Create a boxplot illustrating bone density by sex and diet. Compare what the visualizations reveal about the diet effect by sex. -6. Fit a linear model and test for the diet effect on bone density separately for each sex. Note that the diet effect is statistically significant for females but not for males. Then fit the model to the entire dataset that includes diet, sex and their interaction. Note that the diet effect is significant, yet the interaction effect is not. Explain how this can happen? Hint: To fit a model to the entire dataset that fit a separate effect for males and females you can use the formula ` ~ sex + diet:sex` +6. Fit a linear model and conduct a separate test for the diet effect on bone density for each sex. Note that the diet effect is statistically significant for females but not for males. Then fit the model to the entire dataset that includes diet, sex and their interaction. Notice that the diet effect is significant, yet the interaction effect is not. Explain how this can happen. Hint: To fit a model to the entire dataset with a separate effect for males and females, you can use the formula `~ sex + diet:sex` -7\. In Chapter @sec-models, we talked about pollster bias. We used visualization to motivate the presence of such bias. Here we will give it a more rigorous treatment. We will consider two pollsters that conducted daily polls. We will look at national polls for the month before the election. +7\. In @sec-models, we talked about pollster bias and used visualization to motivate the presence of such bias. Here we will give it a more rigorous treatment. We will consider two pollsters that conducted daily polls. We will look at national polls for the month before the election: ```{r, eval=FALSE} polls <- polls_us_election_2016 |> @@ -389,13 +390,13 @@ We want to answer the question: is there a pollster bias? Make a plot showing th The urn model theory says nothing about pollster effect. Under the urn model, both pollsters have the same expected value: the election day difference, that we call $\mu$. -To answer the question "is there an urn model?", we will model the observed data $Y_{i,j}$ in the following way: +To answer the question "is there an urn model?" we will model the observed data $Y_{i,j}$ in the following way: $$ Y_{i,j} = \mu + b_i + \varepsilon_{i,j} $$ -with $i=1,2$ indexing the two pollsters, $b_i$ the bias for pollster $i$ and $\varepsilon_ij$ poll to poll chance variability. We assume the $\varepsilon$ are independent from each other, have expected value $0$ and standard deviation $\sigma_i$ regardless of $j$. +with $i=1,2$ indexing the two pollsters, $b_i$ the bias for pollster $i$, and $\varepsilon_ij$ poll to poll chance variability. We assume the $\varepsilon$ are independent from each other, have expected value $0$, and standard deviation $\sigma_i$ regardless of $j$. Which of the following best represents our question? @@ -404,9 +405,9 @@ b. How close are the $Y_{i,j}$ to $\mu$? c. Is $b_1 \neq b_2$? d. Are $b_1 = 0$ and $b_2 = 0$ ? -9\. In the right side of this model only $\varepsilon_{i,j}$ is a random variable. The other two are constants. What is the expected value of $Y_{1,j}$? +9\. On the right side of this model, only $\varepsilon_{i,j}$ is a random variable; the other two are constants. What is the expected value of $Y_{1,j}$? -10\. Suppose we define $\bar{Y}_1$ as the average of poll results from the first poll, $Y_{1,1},\dots,Y_{1,N_1}$ with $N_1$ the number of polls conducted by the first pollster: +10\. Suppose we define $\bar{Y}_1$ as the average of poll results from the first poll, $Y_{1,1},\dots,Y_{1,N_1}$, where $N_1$ is the number of polls conducted by the first pollster: ```{r, eval=FALSE} polls |> @@ -418,7 +419,7 @@ What is the expected values $\bar{Y}_1$? 11\. What is the standard error of $\bar{Y}_1$ ? -12\. Suppose we define $\bar{Y}_2$ as the average of poll results from the first poll, $Y_{2,1},\dots,Y_{2,N_2}$ with $N_2$ the number of polls conducted by the first pollster. What is the expected value $\bar{Y}_2$? +12\. Suppose we define $\bar{Y}_2$ as the average of poll results from the first poll, $Y_{2,1},\dots,Y_{2,N_2}$, where $N_2$ is the number of polls conducted by the first pollster. What is the expected value $\bar{Y}_2$? 13\. What is the standard error of $\bar{Y}_2$ ? @@ -435,7 +436,7 @@ b. Because the $Y_{ij}$ are approximately normal, so are the averages. c. Note that $\bar{Y}_2$ and $\bar{Y}_1$ are sample averages, so if we assume $N_2$ and $N_1$ are large enough, each is approximately normal. The difference of normally distributed variables is also normally distributed. d. The data are not 0 or 1, so CLT does not apply. -18\. We have constructed a random variable that has expected value $b_2 - b_1$, the pollster bias difference. If our model holds, then this random variable has an approximately normal distribution and we know its standard error. The standard error depends on $\sigma_1$ and $\sigma_2$, but we can plug the sample standard deviations we computed above. We started off by asking: is $b_2 - b_1$ different from 0? Use all the information we have learned above to construct a 95% confidence interval for the difference $b_2$ and $b_1$. +18\. We have constructed a random variable that has an expected value of $b_2 - b_1$, representing the difference in pollster bias. If our model holds, then this random variable has an approximately normal distribution, and we know its standard error. The standard error depends on $\sigma_1$ and $\sigma_2$, but we can plug the sample standard deviations we computed above. We began by asking: is $b_2 - b_1$ different from 0? FIX Using all the information we have gathered above, construct a 95% confidence interval for the difference $b_2$ and $b_1$. 19\. The confidence interval tells us there is relatively strong pollster effect resulting in a difference of about 5%. Random variability does not seem to explain it. We can compute a p-value to relay the fact that chance does not explain it. What is the p-value? @@ -459,5 +460,5 @@ polls <- polls_us_election_2016 |> ungroup() ``` -Compute the average and standard deviation for each pollster and examine the variability across the averages and how it compares to the variability within the pollsters, summarized by the standard deviation. +Compute the average and standard deviation for each pollster and examine the variability across the averages. Compare this to the variability within the pollsters, summarized by the standard deviation.