diff --git a/03-tidy.Rmd b/03-tidy.Rmd
index dc0b13ebd..c030ac734 100755
--- a/03-tidy.Rmd
+++ b/03-tidy.Rmd
@@ -213,7 +213,7 @@ As can be seen here when you just enter the name of an object in R, by default i
**_Learning check_**
```
-**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Run the following block of code in R to load and view each of the four data frames in the `nycflights13` package. Switch between the different tabs that have opened to view each of the four data frames. Describe in two sentences for each data frame what stands out to you and what the most important features are of each.
+**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Run the following block of code in RStudio to load and view each of the four data frames in the `nycflights13` package. Switch between the different tabs that have opened to view each of the four data frames. Describe in two sentences for each data frame what stands out to you and what the most important features are of each.
```{r eval=FALSE}
data(weather)
diff --git a/bib/packages.bib b/bib/packages.bib
index 200478393..313ff49d5 100755
--- a/bib/packages.bib
+++ b/bib/packages.bib
@@ -23,9 +23,9 @@ @Manual{R-broom
@Manual{R-devtools,
title = {devtools: Tools to Make Developing R Packages Easier},
author = {Hadley Wickham and Winston Chang},
- note = {R package version 1.12.0.9000},
- url = {https://github.com/hadley/devtools},
year = {2016},
+ note = {R package version 1.12.0},
+ url = {https://CRAN.R-project.org/package=devtools},
}
@Manual{R-dplyr,
title = {dplyr: A Grammar of Data Manipulation},
@@ -37,8 +37,8 @@ @Manual{R-dplyr
@Manual{R-dygraphs,
title = {dygraphs: Interface to 'Dygraphs' Interactive Time Series Charting Library},
author = {Dan Vanderkam and JJ Allaire and Jonathan Owen and Daniel Gromer and Petr Shevtsov and Benoit Thieurmel},
- year = {2017},
- note = {R package version 1.1.1.4},
+ year = {2016},
+ note = {R package version 1.1.1.3},
url = {https://CRAN.R-project.org/package=dygraphs},
}
@Manual{R-fivethirtyeight,
@@ -52,7 +52,7 @@ @Manual{R-ggplot2
title = {ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics},
author = {Hadley Wickham and Winston Chang},
year = {2016},
- note = {R package version 2.2.1},
+ note = {R package version 2.2.0},
url = {https://CRAN.R-project.org/package=ggplot2},
}
@Manual{R-ggplot2movies,
@@ -107,8 +107,13 @@ @Manual{R-mvtnorm
@Manual{R-nycflights13,
title = {nycflights13: Flights that Departed NYC in 2013},
author = {Hadley Wickham},
+<<<<<<< HEAD
year = {2017},
note = {R package version 0.2.2},
+=======
+ year = {2016},
+ note = {R package version 0.2.0},
+>>>>>>> 5223f11c169d8e511baaf89c512ab1e07fdad40e
url = {https://CRAN.R-project.org/package=nycflights13},
}
@Manual{R-okcupiddata,
@@ -130,7 +135,7 @@ @Manual{R-rmarkdown
title = {rmarkdown: Dynamic Documents for R},
author = {JJ Allaire and Joe Cheng and Yihui Xie and Jonathan McPherson and Winston Chang and Jeff Allen and Hadley Wickham and Aron Atkins and Rob Hyndman},
year = {2016},
- note = {R package version 1.3},
+ note = {R package version 1.2},
url = {https://CRAN.R-project.org/package=rmarkdown},
}
@Manual{R-tibble,
@@ -151,6 +156,6 @@ @Manual{R-webshot
title = {webshot: Take Screenshots of Web Pages},
author = {Winston Chang},
year = {2016},
- note = {R package version 0.4.0},
+ note = {R package version 0.3.2},
url = {https://CRAN.R-project.org/package=webshot},
}
diff --git a/docs/10-effective-data-storytelling.html b/docs/10-effective-data-storytelling.html
index a35a78cf9..2e7be337f 100644
--- a/docs/10-effective-data-storytelling.html
+++ b/docs/10-effective-data-storytelling.html
@@ -54,8 +54,7 @@
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
-
-
+
+
+
The syntax here is a little different than what we have covered so far. The dygraph
function is expecting for the dates to be given as the rownames
of the object. We then remove the date
variable from the flights_summarized
dataframe since it is accounted for in the rownames
. Lastly, we run the dygraph
function on the new dataframe that only contains the median arrival delay as a column and then provide the ability to have a selector to zoom in on the interactive plot via dyRangeSelector
. (Note that this plot will only be interactive in the HTML version of this book.)
We will focus only on the dplyr functions in this book, but you are encouraged to also explore tidyr if you are presented with data that is not in the tidy format that we have specified as the preferred option for our purposes. 5.5.2 Script of R code An R script file of all R code used in this chapter is available here. 5.5.3 What’s to come? This concludes the Data Exploration unit of this book. You should be pretty proficient in both plotting variables (or multiple variables together) in various data sets and manipulating data as we’ve done in this chapter. You are encouraged to step back through the code in earlier chapters and make changes as you see fit based on your updated knowledge. In Chapter 6, we’ll begin to build the pieces needed to understand how this unit of Data Exploration can tie into statistical inference in the Inference part of the book. Remember that the focus throughout is on data visualization and we’ll see that next when we discuss sampling, resampling, and bootstrapping. These ideas will lead us into hypothesis testing and confidence intervals. References "],
+["6-sim.html", "6 Simulating Randomness via mosaic 6.1 Random sampling 6.2 Visualizing sampling 6.3 Simulation 6.4 Review of mosaic simulation functions 6.5 Conclusion", " 6 Simulating Randomness via mosaic In this chapter we will introduce new concepts that will serve as the basis for the remainder of the text: sampling and resampling. We will see that the tools that you learned in the Data Exploration part of this book (tidy data, data visualization, and data manipulation) will also play an important role here. As mentioned before, the concepts throughout this text all build into a culmination allowing you to create better stories with data. We begin with some helpful definitions that will help us better understand why statistical inference exists and why it is needed. We will then progress with introducing the second of our main data sets (in addition to the nycflights13 data you’ve been working with) about OKCupid dating profiles to see how one can think of the distribution of a sample being an approximation of the distribution of the population. We will also focus on representative, random samples versus convenience samples in this context. We then shift to a famous example from statistics lore on a lady tasting tea. This section will focus on introducing concepts without a lot of statistical jargon. The chapter will conclude with a summary of the different functions introduced in the mosaic package in this chapter. Needed packages library(dplyr) library(ggplot2) library(okcupiddata) library(mosaic) library(knitr) 6.1 Random sampling Whenever you hear the phrases “random sampling” or just “sampling” (with regards to statistics), you should think about tasting soup. This likely sounds a little bonkers. Let’s dig into why tasting soup is such an excellent analogy to random sampling. 6.1.1 Tasting soup Figure 6.1: A bowl of Indian chicken and vegetable soup Imagine that you have invited a group of friends over to try a new recipe for soup that you’ve never made before. As in the image above downloaded from here, you’d like to make a bowl of Indian chicken soup with lots of different kinds of vegetables included. You’ve carefully followed along with the recipe but you are concerned that you don’t have a lot of experience making foods from India. It is coming near the end of the prescribed time to cook given in the recipe. You begin to wonder: “Did I add too much curry spice?” “Are the carrots cooked enough?” “Does this actually taste good?” How can we answer these questions? Does it matter where we take a bite of soup from? Is there anything we should do to the soup before we taste? Is one taste enough? 6.1.2 Common terms The process of sampling brings with it many common terms that we define now. As you read over these definitions, think about how they each apply to the tasting soup example above. Definition: population The population is the (usually) large pool of observational units that we are interested in. Definition: sample A sample is a smaller collection of observational units that is selected from the population. Definition: sampling Sampling refers to the process of selecting observations from a population. There are both random and non-random ways this can be done. Definition: representative sample A sample is said be a representative sample if the characteristics of observational units selected are a good approximation of the characteristics from the original population. Definition: bias Bias corresponds to a favoring of one group in a population over another group. Definition: generalizability Generalizability refers to the largest group in which it makes sense to make inferences about from the sample collected. This is directly related to how the sample was selected. Definition: parameter A parameter is a calculation based on one or more variables measured in the population. Parameters are almost always denoted symbolically using Greek letters such as \\(\\mu\\), \\(\\pi\\), \\(\\sigma\\), \\(\\rho\\), and \\(\\beta\\). Definition: statistic A statistic is a calculated based on one or more variables measured in the sample. Parameters are usually denoted by lower case Arabic letters with other symbols added sometimes. These include \\(\\bar{x}\\), \\(\\hat{p}\\), \\(s\\), \\(p\\), and \\(b\\). Learning check (LC6.1) Explain in your own words how tasting soup relates to the concepts of sampling covered here. (LC6.2) Describe a different scenario (not food or drink related) that is analogous to sampling concepts covered here. Let’s explore these terms for our tasting soup example: Population - the entire container of soup that we have cooked. Sample - any smaller portion of soup collected that isn’t the whole container of soup. We could say that each spoonful of soup represents one sample. Sampling - the process of selecting spoonfuls from the container of soup Representative sample - A sample we select will only be representative if it tastes like what the soup tastes like in general. If we only select a carrot in our spoonful, we might not have a representative sample. Bias - As we noted with the carrot selection example above, we may select a sample that is not representative. If you watch chefs cook or if you frequently cook, you’ll be sure to stir the soup before you taste it. Generalizability - If we stir our soup before we taste a spoonful (and if we make sure we don’t just pick our favorite item in the soup), results from our sample can be generalized (by and large) to the larger pot of soup. When we say “Yum! This is good!” after a couple spoonfuls, we can be pretty confident that each bowl of soup for our friends will taste good too. Parameter - An example here is could be the proportion of curry entered into the entire pot of soup. A measurement of how salty the pot of soup is on average is also a parameter. How crunchy, on average, the carrots are in the pot of soup is one more example. Statistic - To convert a parameter to a statistic, you need only to think about the same measurement on a spoonful: The proportion of curry to non-curry in a spoonful of soup How salty the spoonful of soup is that we collected as our sample How crunchy the carrots are in our spoonful of soup Learning check (LC6.3) Why isn’t our population all bowls of soup? All bowls of Indian chicken soup? (LC6.4) Describe a way in which we could select a sample of flights from nycflights13 that is not representative. (LC6.5) If we treat all of the flights in nycflights13 as the population, give examples of three parameters we could calculate. (LC6.6) If we treat all of the flights in nycflights13 as the population, give examples of three statistics we could calculate. (LC6.7) What biases might we see if we only select flights to Boston when we are interested in looking at mean flight delays from NYC? 6.2 Visualizing sampling Let’s explore how sampling and these other terms relate to working with data and data visualization. Here we introduce the okcupiddata R package (Kim and Escobedo-Land 2016). Note that permission to use this data to create the R package was explicitly granted by OkCupid. More information about this package is available here. The profiles data frame in this R data package contains data about 59,946 OkCupid users who were living within 25 miles of San Francisco, had active profiles on June 26, 2012, were online in the previous year, and had at least one picture in their profile. We will be focusing on the height variable, which corresponds to a self-reported height for each individual on their profile. Note that this is measured in inches. library(okcupiddata) data(profiles) Let’s take a look at the distribution of height using a histogram and ggplot2: library(ggplot2) ggplot(data = profiles, mapping = aes(x = height)) + geom_histogram(bins = 20, color = "white") We see here that this being self-reported data has led to the data being a little messy. Learning check (LC6.8) Why does the histogram go all the way back to 0 for height and all the way up to 100? To clean up the data a bit, let’s focus on just looking at heights between 55 inches and 85 inches. Remember that the filter function in dplyr allows us to focus on a subset of rows. The specific subset of rows we are interested in corresponds to the argument to the filter function. We will create a new data frame called profiles_subset that contains all rows with heights between 55 and 85 inclusive. library(dplyr) profiles_subset <- profiles %>% filter(between(height, 55, 85)) Next, let’s produce the same histogram as above but using the profiles_subset data frame instead. ggplot(data = profiles_subset, mapping = aes(x = height)) + geom_histogram(bins = 20, color = "white") We can think of this data as representing the population of interest. Let’s now take a random sample of size 100 from this population and look to see if this sample represents the overall shape of the population. In other words, we are going to use data visualization as our guide to understand the representativeness of the sample selected. library(mosaic) set.seed(2017) profiles_sample1 <- profiles_subset %>% resample(size = 100, replace = FALSE) The set.seed function is used to ensure that all users get the same random sample when they run the code above. It is a way of interfacing with the pseudo-random number generation scheme that R uses to generate “random” numbers. If that command was not run, you’d obtain a different random sample than someone else if you ran the code above for the first time. We have introduced the resample function from the mosaic package here (Pruim, Kaplan, and Horton 2016). This function can be used for both sampling with and without replacement. Here we have chosen to sample without replacement. In other words, after the first row is chosen from the profiles_subset data frame at random it is kept out of the further 99 samples. Let’s now visualize the 100 values of the height variable in the profiles_sample1 data frame. To keep this visualization on the same horizontal scale as our original population presented in profiles_subset we can use the coord_cartesian function along with the c function to specify the limits on the horizontal axis. ggplot(data = profiles_sample1, mapping = aes(x = height)) + geom_histogram(bins = 20, color = "white", fill = "red") + coord_cartesian(xlim = c(55, 85)) Learning check (LC6.9) Does this random sample of height represent the population height variable well? Explain why or why not in a couple of sentences. We now repeat this process of sampling to look to see how another random sample of height compares to the original population distribution. profiles_sample2 <- profiles_subset %>% resample(size = 100, replace = FALSE) ggplot(data = profiles_sample2, mapping = aes(x = height)) + geom_histogram(bins = 20, color = "black", fill = "yellow") + coord_cartesian(xlim = c(55, 85)) Remember that a sample can never truly quantify all of the properties of a population since it contains less data and, thus, less information. We can use the overall shape as a good guess as to the representativeness of the sample in regards to the population though. We see that the above two random samples of size 100 have roughly the same shape as the original population height data. Let’s next explore what is known as a convenience sample and how its distribution compares to the population distribution. A convenience sample is a sample that is chosen conveniently by the person selecting the sample. While certainly less work, convenience samples are generally not representative of the population since they will exclude some (usually large) portion of the population. Let’s look at values of height in our profiles_subset population that are larger than 6 feet tall (72 inches) and have that be the sample we choose. profiles_sample3 <- profiles_subset %>% filter(height >= 72) ggplot(data = profiles_sample3, mapping = aes(x = height)) + geom_histogram(bins = 20, color = "white", fill = "blue") + coord_cartesian(xlim = c(55, 85)) This is a clear example of a sample that is not representative of the population. The population height variable is roughly symmetric, whereas this distribution is right-skewed. Further, since it only selects large heights it has completely excluded the small and middle heights. We have seen here that data visualization provides an excellent tool in judging the representativeness of a sample. 6.2.1 Sampling distribution The representativeness of a sample plays an even larger role than just looking at the shapes of distributions. Let’s suppose we were interested in estimating the mean height of all profiles in the profiles_subset data frame. To do so, we could look at the mean of the height variable in the profiles_sample1 data frame: profiles_sample1 %>% summarize(mean(height)) ## mean(height) ## 1 68.45 But, we could also use profiles_sample2: profiles_sample2 %>% summarize(mean(height)) ## mean(height) ## 1 68.2 Or maybe even profiles_sample3: profiles_sample3 %>% summarize(mean(height)) ## mean(height) ## 1 73.37917 We see a clear difference here in looking at the mean of height in profiles_sample3 versus profiles_sample1 and profiles_sample2. This comes from the bias that is used in choosing only the top heights for profiles_sample3. If we had chosen to use this sample as our only sample, we would be quite a ways off from what the actual mean height in our population of profiles_subset is. We also see that even random samples produce means that aren’t exactly the same. This sampling variability can be shown via what is called a sampling distribution. This is defined as the behavior of a statistic under repeated sampling. To build this sampling distribution for this example, we’ve created an interactive app using the shiny R package below that is available at http://ismay.shinyapps.io/okcupidheights/. You can specify the sample size you’d like to work with (100 is chosen by default) and then generate a random sample. You then can see the mean of this generated sample plotted in the bottom visualization. Repeating this process many times, you can start to see the shape of the sampling distribution take form. A screenshot of the app is below, but you are encouraged to go to http://ismay.shinyapps.io/okcupidheights/ and test it out on your own. Figure 6.2: Sampling distribution app at http://ismay.shinyapps.io/okcupidheights/. 6.2.2 Repeated sampling via do We have looked at two random samples above, but using mosaic we can repeat this process over and over again with the do function. Below, we repeat this sampling process 10,000 times. We can then plot the different values of the sample means to get a sense for what a reasonable range of values for the population parameter mean height is in the profiles_subset data frame. sample_means <- do(10000) * (profiles_subset %>% resample(size = 100, replace = FALSE) %>% summarize(mean_height = mean(height))) ggplot(data = sample_means, mapping = aes(x = mean_height)) + geom_histogram(color = "white", bins = 20) Note how the range of sample mean height values is much more narrow than the original range of height in the profiles_subset data frame. We also see a characteristic shape to this distribution of mean_height: the normal curve. This idea is commonly associated with statistics and you hopefully have a good sense of how this distribution comes about. As before, if you aren’t quite sure of this yet, go back and explore the shiny app above a bit more. We see that many values for the sample mean appear near the center of the distribution and a few values appear out in the tails providing the bell-shaped distribution linked with the normal distribution. You’ll see more examples of this in the chapters to come and in Appendix B. Learning check (LC6.10) Why do the sample mean values have a much smaller spread than the original population data? You may want to play with the shiny app above a bit to understand why this is the case. (LC6.11) Why is random sampling so important here to create a distribution of sample means that provide a range of plausible values for the population mean height? 6.3 Simulation We next will introduce the ideas behind hypothesis testing that we will delve into more formally in the chapters to come. What follows is taken from a book entitled The Lady Tasting Tea (Salsburg 2001): It was a summer afternoon in Cambridge, England, in the late 1920s. A group of university dons, their wives, and some guests were sitting around an outdoor table for afternoon tea. One of the women was insisting that tea tasted different depending upon whether the tea was poured into the milk or whether the milk was poured into the tea. The scientific minds among the men scoffed at this as sheer nonsense. What could be the difference? They could not conceive of any difference in the chemistry of the mixtures that could exist. A thin, short man, with thick glasses and a Vandyke beard beginning to turn gray, pounced on the problem. “Let us test the proposition,” he said excitedly. He began to outline an experiment in which the lady who insisted there was a difference would be presented with a sequence of cups of tea, in some of which the milk had been poured into the tea and in others of which the tea had been poured into the milk… So it was that sunny summer afternoon in Cambridge. The lady might or might not have been correct about the tea infusion. The fun would be in finding a way to determine if she was right, and, under the direction of the man with the Vandyke beard, they began to discuss how they might make that determination. Enthusiastically, many of them joined with him in setting up the experiment. Within a few minutes, they were pouring different patterns of infusion in a place where the lady could not see which cup was which. Then, with an air of finality, the man with the Vandyke beard presented her with her first cup. She sipped for a minute and declared that it was one where the milk had been poured into the tea. He noted her response without comment and presented her with the second cup… The man with the Vandyke beard was Ronald Aylmer Fisher, who was in his late thirties at the time. He would later be knighted Sir Ronald Fisher. In 1935, he wrote a book entitled The Design of Experiments, and he described the experiment of the lady tasting tea in the second chapter of that book. In his book, Fisher discusses the lady and her belief as a hypothetical problem. He considers the various ways in which an experiment might be designed to determine if she could tell the difference. The problem in designing the experiment is that, if she is given a single cup of tea, she has a 50 percent chance of guessing correctly which infusion was used, even if she cannot tell the difference. If she is given two cups of tea, she still might guess correctly. In fact, if she knew that the two cups of tea were each made with a different infusion, one guess could be completely right (or completely wrong). Similarly, even if she could tell the difference, there is some chance that she might have made a mistake, that one of the cups was not mixed as well or that the infusion was made when the tea was not hot enough. She might be presented with a series of ten cups and correctly identify only nine of them, even if she could tell the difference. In his book, Fisher discusses the various possible outcomes of such an experiment. He describes how to decide how many cups should be presented and in what order and how much to tell the lady about the order of presentations. He works out the probabilities of different outcomes, depending upon whether the lady is or is not correct. Nowhere in this discussion does he indicate that such an experiment was ever run. Nor does he describe the outcome of an actual experiment. It’s amazing that there is no actual evidence that such an event actually took place. This problem is a great introduction into inference though and we can proceed by testing to see how likely it is for a person to guess correctly, say, 9 out of 10 times, assuming that person is just guessing. In other words, is the person just lucky or do we have reason to suspect that they can actually detect whether milk was put in first or not? We need to think about this problem from the standpoint of hypothesis testing. First, we’ll need to identify some important parts of a hypothesis test before we proceed with the analysis. Learning check (LC6.12) What does “by chance” mean in this context? (LC6.13) What is our observed statistic? (LC6.14) What is this statistic trying to estimate? (LC6.15) How could we test to see whether the person is just guessing or if they have some special talent of identifying milk before tea or vice-versa? Let’s begin with an experiment. I will flip a coin 10 times. Your job is to try to predict the sequence of my 10 flips. Write down 10 H’s and T’s corresponding to your predictions. We could compare your guesses with my actual flips and then we will note how many correct guesses you have. You may be asking yourself how this models a way to test whether the person was just guessing or not. All we are trying to do is see how likely it is to have 9 matches out of 10 if the person was truly guessing. When we say “truly guessing” we are assuming that we have a 50/50 chance of guessing correctly. This can be modeled using a coin flip and then seeing whether we guessed correctly for each of the coin flips. If we guessed correctly, we can think of that as a “success.” We often don’t have time to do the physical flipping over and over again and we’d like to be able to do more than just 20 different simulations or so. Luckily, we can use R to simulate this process many times. The mosaic package includes a function called rflip(), which can be used to flip one coin. Well, not exactly. It uses pseudo-random number generation to “flip” a virtual coin. In order for us all to get the same results here, we can set the seed of the pseudo-random number generator. Let’s see an example of this: (Remember to load the mosaic package!) library(mosaic) set.seed(2017) do(1) * rflip(1) ## n heads tails prop ## 1 1 1 0 1 This shows us the proportion of “successes” in one flip of a coin. The do function in the mosaic package will be useful and you can begin to understand what it does with another example. do(13) * rflip(10) ## n heads tails prop ## 1 10 4 6 0.4 ## 2 10 5 5 0.5 ## 3 10 5 5 0.5 ## 4 10 7 3 0.7 ## 5 10 5 5 0.5 ## 6 10 7 3 0.7 ## 7 10 5 5 0.5 ## 8 10 4 6 0.4 ## 9 10 7 3 0.7 ## 10 10 2 8 0.2 ## 11 10 4 6 0.4 ## 12 10 5 5 0.5 ## 13 10 4 6 0.4 We’ve now done a simulation of what actually happened when you flipped a coin ten times. We have 13 different simulations of flipping a coin 10 times. Note here that heads now corresponds to the number of correct guesses and tails corresponds to the number of incorrect guesses. (This can be tricky to understand at first since we’ve done a switch on what the meaning of “heads” and ``tails" are.) If you look at the output above for our simulation of 13 student guesses, we can begin to get a sense for what an “expected” sample proportion of successes may be. Around five out of 10 seems to be the most likely value. What does this say about what we actually observed with a success rate of 9/10? To better answer this question, we can simulate 10,000 student guesses and then look at the distribution of the simulated sample proportion of successes, also known as the null distribution. library(dplyr) simGuesses <- do(10000) * rflip(10) simGuesses %>% group_by(heads) %>% summarize(count = n()) ## # A tibble: 11 × 2 ## heads count ## <dbl> <int> ## 1 0 9 ## 2 1 98 ## 3 2 431 ## 4 3 1197 ## 5 4 2016 ## 6 5 2459 ## 7 6 2066 ## 8 7 1211 ## 9 8 408 ## 10 9 91 ## 11 10 14 We can see here that we have created a count of how many of each of the 10,000 sets of 10 flips resulted in 0, 1, 2, \\(\\ldots\\), up to 10 heads. Note the use of the group_by and summarize functions from Chapter 5 here. In addition, we can plot the distribution of these simulated heads using the ideas from Chapter 4. heads is a quantitative variable. Think about which type of plot is most appropriate here before reading further. We already have an idea as to an appropriate plot by the data summarization that we did in the chunk above. We’d like to see how many heads occurred in the 10,000 sets of 10 flips. In other words, we’d like to see how frequently 9 or more heads occurred in the 10 flips: library(ggplot2) simGuesses %>% ggplot(aes(x = heads)) + geom_histogram(binwidth = 1, color = "white") Figure 6.3: Histogram of number of heads in simulation - needs tweaking This horizontal axis labels are a little confusing here. What does 2.5 or 7.5 heads mean? In simGuesses, heads is a numerical variable. Thus, ggplot is expecting the values to be on a continuous scale. We can switch the scale to be discrete by invoking the factor function and using geom_bar instead of geom_histogram: library(ggplot2) simGuesses %>% ggplot(aes(x = factor(heads))) + geom_bar() Figure 6.4: Barplot of number of heads in simulation You’ll frequently need to make this conversion to factor when making a barplot with quantitative variables. Remember from “Getting Used to R, RStudio, and R Markdown” (Ismay 2016), that a factor variable is useful when there is a natural ordering to the variable and it only takes on discrete values and not fractional values like 2.5. Our heads variable has a natural ordering: 0, 1, 2, \\(\\ldots\\), 10. Again, note that the shape of these number of heads follows what appears to be a normal distribution. We’ll see in a related example that if appropriate conditions/assumptions are met with the data that we can expect to see a normal distribution result. When these conditions aren’t met, the simulation methodology we’ve presented here still works well whereas the traditional normal-based methods start to fall apart. We will delve further into hypothesis testing in the next few chapters. This null distribution in combination with the sampling distribution concept covered earlier will be of utmost importance going forward. 6.4 Review of mosaic simulation functions In this chapter, we’ve discussed three functions in the mosaic package useful in understanding the stepping stones to statistical inference: do, rflip, and resample. We will also work with the shuffle function in later chapters and we summarize it here for your reference later. do: Its main use is in replicating a process many times. It has one argument n which specifies how many times to replicate the process. It then uses *, which can be read as “times”, and whatever follows immediately after it as the process. rflip: This is used to simulate the flipping of a coin many times. By default, it flips a fair coin one time giving an equal chance to heads and tails. It is frequently used with do() * to simulate many coin flips in multiple sets. resample: This is used to sample from a larger data set with or without replacement. When we are thinking about the concept of random sampling, we sample without replacement. We can also sample with replacement corresponding to the values being replaced into the pool to draw from with the possibility that they are drawn again in the resample. This will be of great importance when we discuss bootstrapping with confidence intervals. shuffle: Its main purpose is to permute the values of one variable across the values of another variable. This acts in much the same way as shuffling a deck of cards and then presenting the shuffled deck to two (or more) players. Learning check (LC6.16) Recreate rflip using only the resample function and specifying the appropriate arguments. (LC6.17) Recreate shuffle using only the resample function and specifying the appropriate arguments. 6.5 Conclusion 6.5.1 Script of R code An R script file of all R code used in this chapter is available here. 6.5.2 What’s to come? This chapter has served as an introduction into inferential techniques that will be discussed in greater detail in Chapter 7 for hypothesis testing and in Chapter 8 for confidence intervals. In these chapters, we will see how we can use a related concept of resampling when working with the distributions of two groups. All of these concepts will be further reinforced in Chapter 9 as well. References "],
+["7-hypo.html", "7 Hypothesis Testing 7.1 When Inference Is Not Needed 7.2 Basics of Hypothesis Testing 7.3 Criminal trial analogy 7.4 Types of Errors in Hypothesis Testing 7.5 Statistical Significance 7.6 EXAMPLE: Revisiting the Lady Tasting Tea 7.7 EXAMPLE: Comparing two means 7.8 Building theory-based methods using computation 7.9 Conclusion", " 7 Hypothesis Testing We saw some of the main concepts of hypothesis testing introduced in Chapter 6. We will expand further on these ideas here and also provide a framework for understanding hypothesis tests in general. Instead of presenting you with lots of different formulas and scenarios, we hope to build a way to think about all hypothesis tests. You can then adapt to different scenarios as needed down the road when you encounter different statistical situations. The same can be said for confidence intervals. There is one general framework that applies to all confidence intervals and we will elaborate on this further in Chapter 8. The specifics may change slightly for each variation, but the important idea is to understand the general framework so that you can apply it to more specific problems. We believe that this approach is much better in the long-term than teaching you specific tests and confidence intervals rigorously. You can find fully-worked out examples for five common hypothesis tests and their corresponding confidence intervals in Appendix B. We recommend that you carefully review these examples as they also cover how the general frameworks apply to traditional normal-based methodologies like the \\(t\\)-test and normal-theory confidence intervals. You’ll see there that these methods are just approximations for the general computational frameworks, but require conditions to be met for their results to be valid. The general frameworks using randomization, simulation, and bootstrapping do not hold the same sorts of restrictions and further advance computational thinking, which is one big reason for their emphasis throughout this textbook. Needed packages library(dplyr) library(ggplot2) library(mosaic) library(knitr) library(nycflights13) 7.1 When Inference Is Not Needed Before we delve into the two techniques of inference (hypothesis testing and confidence intervals), it’s good to remember that there are cases where you need not perform a rigorous statistical inference. An important and time-saving skill is to ALWAYS do exploratory data analysis using dplyr and ggplot2 before thinking about running a hypothesis test. Let’s look at such an example selecting a sample of flights traveling to Boston and to San Francisco from New York City in the flights data frame in the nycflights13 package. (We will remove flights with missing data first using na.omit and then sample 100 flights going to each of the two airports.) library(nycflights13) data(flights) bos_sfo <- flights %>% na.omit() %>% filter(dest %in% c("BOS", "SFO")) %>% group_by(dest) %>% sample_n(100) Suppose we were interested in seeing if the air_time to SFO in San Francisco was statistically greater than the air_time to BOS in Boston. As suggested, let’s begin with some exploratory data analysis to get a sense for how the two variables of air_time and dest relate for these two destination airports: library(dplyr) bos_sfo_summary <- bos_sfo %>% group_by(dest) %>% summarize(mean_time = mean(air_time), sd_time = sd(air_time)) kable(bos_sfo_summary) dest mean_time sd_time BOS 38.35 5.726732 SFO 345.61 15.354988 Looking at these results, we can clearly see that SFO air_time is much larger than BOS air_time. The standard deviation is also extremely informative here. Learning check (LC7.1) Could we make the same type of immediate conclusion that SFO had a statistically greater air_time if, say, its corresponding standard deviation was 200 minutes? What about 100 minutes? Explain. To further understand just how different the air_time variable is for BOS and SFO, let’s look at a boxplot: library(ggplot2) ggplot(data = bos_sfo, mapping = aes(x = dest, y = air_time)) + geom_boxplot() Since there is no overlap at all, we can conclude that the air_time for San Francisco flights is statistically greater (at any level of significance) than the air_time for Boston flights. This is a clear example of not needing to do anything more than some simple descriptive statistics to get an appropriate inferential conclusion. This is one reason why you should ALWAYS investigate the sample data first using dplyr and ggplot2 via exploratory data analysis. As you get more and more practice with hypothesis testing, you’ll be better able to determine in many cases whether or not the results will be statistically significant. There are circumstances where it is difficult to tell, but you should always try to make a guess FIRST about significance after you have completed your data exploration and before you actually begin the inferential techniques. 7.2 Basics of Hypothesis Testing In a hypothesis test, we will use data from a sample to help us decide between two competing hypotheses about a population. We make these hypotheses more concrete by specifying them in terms of at least one population parameter of interest. We refer to the competing claims about the population as the null hypothesis, denoted by \\(H_0\\), and the alternative (or research) hypothesis, denoted by \\(H_a\\). The roles of these two hypotheses are NOT interchangeable. The claim for which we seek significant evidence is assigned to the alternative hypothesis. The alternative is usually what the experimenter or researcher wants to establish or find evidence for. Usually, the null hypothesis is a claim that there really is “no effect” or “no difference.” In many cases, the null hypothesis represents the status quo or that nothing interesting is happening. We assess the strength of evidence by assuming the null hypothesis is true and determining how unlikely it would be to see sample results/statistics as extreme (or more extreme) as those in the original sample. Hypothesis testing brings about many weird and incorrect notions in the scientific community and society at large. One reason for this is that statistics has traditionally been thought of as this magic box of algorithms and procedures to get to results and this has been readily apparent if you do a Google search of “flowchart statistics hypothesis tests”. There are so many different complex ways to determine which test is appropriate. You’ll see that we don’t need to rely on these complicated series of assumptions and procedures to conduct a hypothesis test any longer. These methods were introduced in a time when computers weren’t powerful. Your cellphone (in 2016) has more power than the computers that sent NASA astronauts to the moon after all. We’ll see that ALL hypothesis tests can be broken down into the following framework given by Allen Downey here: Figure 7.1: Hypothesis Testing Framework Before we hop into this framework, we will provide another way to think about hypothesis testing that may be useful. 7.3 Criminal trial analogy We can think of hypothesis testing in the same context as a criminal trial in the United States. A criminal trial in the United States is a familiar situation in which a choice between two contradictory claims must be made. The accuser of the crime must be judged either guilty or not guilty. Under the U.S. system of justice, the individual on trial is initially presumed not guilty. Only STRONG EVIDENCE to the contrary causes the not guilty claim to be rejected in favor of a guilty verdict. The phrase “beyond a reasonable doubt” is often used to set the cutoff value for when enough evidence has been given to convict. Theoretically, we should never say “The person is innocent.” but instead “There is not sufficient evidence to show that the person is guilty.” Now let’s compare that to how we look at a hypothesis test. The decision about the population parameter(s) must be judged to follow one of two hypotheses. We initially assume that \\(H_0\\) is true. The null hypothesis \\(H_0\\) will be rejected (in favor of \\(H_a\\)) only if the sample evidence strongly suggests that \\(H_0\\) is false. If the sample does not provide such evidence, \\(H_0\\) will not be rejected. The analogy to “beyond a reasonable doubt” in hypothesis testing is what is known as the significance level. This will be set before conducting the hypothesis test and is denoted as \\(\\alpha\\). Common values for \\(\\alpha\\) are 0.1, 0.01, and 0.05. 7.3.1 Two possible conclusions Therefore, we have two possible conclusions with hypothesis testing: Reject \\(H_0\\) Fail to reject \\(H_0\\) Gut instinct says that “Fail to reject \\(H_0\\)” should say “Accept \\(H_0\\)” but this technically is not correct. Accepting \\(H_0\\) is the same as saying that a person is innocent. We cannot show that a person is innocent; we can only say that there was not enough substantial evidence to find the person guilty. When you run a hypothesis test, you are the jury of the trial. You decide whether there is enough evidence to convince yourself that \\(H_a\\) is true (“the person is guilty”) or that there was not enough evidence to convince yourself \\(H_a\\) is true (“the person is not guilty”). You must convince yourself (using statistical arguments) which hypothesis is the correct one given the sample information. Important note: Therefore, DO NOT WRITE “Accept \\(H_0\\)” any time you conduct a hypothesis test. Instead write “Fail to reject \\(H_0\\).” 7.4 Types of Errors in Hypothesis Testing Unfortunately, just as a jury or a judge can make an incorrect decision in regards to a criminal trial by reaching the wrong verdict, there is some chance we will reach the wrong conclusion via a hypothesis test about a population parameter. As with criminal trials, this comes from the fact that we don’t have complete information, but rather a sample from which to try to infer about a population. The possible erroneous conclusions in a criminal trial are an innocent person is convicted (found guilty) or a guilty person is set free (found not guilty). The possible errors in a hypothesis test are rejecting \\(H_0\\) when in fact \\(H_0\\) is true (Type I Error) or failing to reject \\(H_0\\) when in fact \\(H_0\\) is false (Type II Error). The risk of error is the price researchers pay for basing an inference about a population on a sample. With any reasonable sample-based procedure, there is some chance that a Type I error will be made and some chance that a Type II error will occur. To help understand the concepts of Type I error and Type II error, observe the following table: Figure 7.2: Type I and Type II errors If we are using sample data to make inferences about a parameter, we run the risk of making a mistake. Obviously, we want to minimize our chance of error; we want a small probability of drawing an incorrect conclusion. The probability of a Type I Error occurring is denoted by \\(\\alpha\\) and is called the significance level of a hypothesis test The probability of a Type II Error is denoted by \\(\\beta\\). Formally, we can define \\(\\alpha\\) and \\(\\beta\\) in regards to the table above, but for hypothesis tests instead of a criminal trial. \\(\\alpha\\) corresponds to the probability of rejecting \\(H_0\\) when, in fact, \\(H_0\\) is true. \\(\\beta\\) corresponds to the probability of failing to reject \\(H_0\\) when, in fact, \\(H_0\\) is false. Ideally, we want \\(\\alpha = 0\\) and \\(\\beta = 0\\), meaning that the chance of making an error does not exist. When we have to use incomplete information (sample data), it is not possible to have both \\(\\alpha = 0\\) and \\(\\beta = 0\\). We will always have the possibility of at least one error existing when we use sample data. Usually, what is done is that \\(\\alpha\\) is set before the hypothesis test is conducted and then the evidence is judged against that significance level. Common values for \\(\\alpha\\) are 0.05, 0.01, and 0.10. If \\(\\alpha = 0.05\\), we are using a testing procedure that, used over and over with different samples, rejects a TRUE null hypothesis five percent of the time. So if we can set \\(\\alpha\\) to be whatever we want, why choose 0.05 instead of 0.01 or even better 0.0000000000000001? Well, a small \\(\\alpha\\) means the test procedure requires the evidence against \\(H_0\\) to be very strong before we can reject \\(H_0\\). This means we will almost never reject \\(H_0\\) if \\(\\alpha\\) is very small. If we almost never reject \\(H_0\\), the probability of a Type II Error – failing to reject \\(H_0\\) when we should – will increase! Thus, as \\(\\alpha\\) decreases, \\(\\beta\\) increases and as \\(\\alpha\\) increases, \\(\\beta\\) decreases. We, therefore, need to strike a balance in \\(\\alpha\\) and \\(\\beta\\) and the common values for \\(\\alpha\\) of 0.05, 0.01, and 0.10 usually lead to a nice balance. Learning check (LC7.2) Reproduce the table above about errors, but for a hypothesis test, instead of the one provided for a criminal trial. 7.4.1 Logic of Hypothesis Testing Take a random sample (or samples) from a population (or multiple populations) If the sample data are consistent with the null hypothesis, do not reject the null hypothesis. If the sample data are inconsistent with the null hypothesis (in the direction of the alternative hypothesis), reject the null hypothesis and conclude that there is evidence the alternative hypothesis is true (based on the particular sample collected). 7.5 Statistical Significance The idea that sample results are more extreme than we would reasonably expect to see by random chance if the null hypothesis were true is the fundamental idea behind statistical hypothesis tests. If data at least as extreme would be very unlikely if the null hypothesis were true, we say the data are statistically significant. Statistically significant data provide convincing evidence against the null hypothesis in favor of the alternative, and allow us to generalize our sample results to the claim about the population. Learning check (LC7.3) What is wrong about saying “The defendant is innocent.” based on the US system of criminal trials? (LC7.4) What is the purpose of hypothesis testing? (LC7.5) What are some flaws with hypothesis testing? How could we alleviate them? 7.6 EXAMPLE: Revisiting the Lady Tasting Tea Recall the “There is Only One Test” diagram from earlier: Figure 7.3: Hypothesis Testing Framework We will now walk through how each of the steps to the diagram apply to determining whether the lady tasting tea was actually better than chance at determining whether or not milk was added first. We will see that the process of creating a null distribution is a statistical way to quantifying surprise. 7.6.1 Data Let’s assume as we did in Chapter 6 that the lady is correct in determining whether milk was added first or not in 9 out of 10 trials. Our data, therefore, may look something like Correct Correct Correct Incorrect Correct Correct Correct Correct Correct Correct 7.6.2 Test Statistic \\(\\delta\\) We are interested in the number of Correct out of our 10 trials. We can denote this number of successes using the symbol \\(t\\), where \\(t\\) corresponds to total. This is our test statistic \\(\\delta\\) in this case. 7.6.3 Observed effect \\(\\delta^*\\) The actual observed value of the test statistic from our observed sample is \\(\\hat{t}_{obs} = 9\\). Thus, \\(\\delta^* = 9\\). 7.6.4 Model of \\(H_0\\) Our null hypothesis is that the lady is only as good as chance at guessing correctly. Hypotheses always correspond to parameters and are denoted with Greek letters. Thus, symbolically, we have \\(H_0: \\tau = 5\\). Since we are assuming chance and we have 10 flips with 0.5 probability of success of each flip, we have \\(\\tau = 10 \\times 0.5 = 5\\). 7.6.5 Simulated Data We now want to use this null hypothesis to simulate the test statistic assuming that the null hypothesis is true. Therefore, we want to figure out a way to simulate 10 trials, getting either the choice Correct or Incorrect, assuming that the probability of success (getting it Correct) in any given trial is 0.5. Tactile simulation When you are presented with a hypothesis testing problem, frequently the most challenging portion is setting up how to simulate the data assuming the null hypothesis is true. To facilitate with this, setting up a tactile, hands on experiment can help. In this case, flipping a fair coin is a great way to simulate this process. This simulates how the sample could be collected assuming the null hypothesis is true. To simulate 10 trials, we could flip the fair coin and record Heads as Correct and Tails as Incorrect. Some simulated data using this coin flipping procedure may look like the following. Note that this data frame is not tidy, but is a convenient way to look at the results of the simulation in this wide format. The numbers on the fair left correspond to the number of the trial. Table 7.1: A table of three sets of 10 coin flips sample1 sample2 sample3 1 Correct Correct Correct 2 Correct Incorrect Incorrect 3 Incorrect Incorrect Correct 4 Incorrect Incorrect Correct 5 Correct Incorrect Incorrect 6 Correct Incorrect Correct 7 Incorrect Incorrect Correct 8 Incorrect Correct Incorrect 9 Incorrect Correct Incorrect 10 Incorrect Correct Incorrect We then use the formula for the Test Statistic to determine the simulated test statistic for each of these simulated samples. So in this case we have \\(t_1 = 4\\), \\(t_2 = 4\\), \\(t_3 = 5\\) 7.6.6 Distribution of \\(\\delta\\) under \\(H_0\\) We could continue this process, say, 10,000 times by flipping a coin in sets of 10 for 10,000 repetitions and counting and taking note of how many heads out of 10 we have for each set. It’s at this point that you surely realize that a computer can do this procedure much faster and more efficient than the tactile experiment with a coin. Recall that we’ve already created the distribution of 10,000 such coin flips and we’ve stored these values in the heads variable in the simGuesses data frame: library(ggplot2) ggplot(data = simGuesses, aes(x = factor(heads))) + geom_bar() 7.6.7 The p-value Definition: \\(p\\)-value: The p-value is the probability of observing a sample statistic as extreme or more extreme than what was observed, assuming that the null hypothesis of a by chance operation is true. This definition may be a little intimidating the first time you read it, but it’s important to come back to this “The Lady Tasting Tea” problem whenever you encounter \\(p\\)-values as you begin to learn about the concept. Here the \\(p\\)-value corresponds to how many times in our null distribution of heads 9 or more heads occurred. We can use another neat feature of R to calculate the \\(p\\)-value for this problem. Note that “more extreme” in this case corresponds to looking at values of 9 or greater since our alternative hypothesis invokes a right-tail test corresponding to a “greater than” hypothesis of \\(H_a: \\tau > 5\\). In other words, we are looking to see how likely it is for the lady to pick 9 or more correct instead of 9 or less correct. We’d like to go in the right direction. pvalue_tea <- simGuesses %>% filter(heads >= 9) %>% nrow() / nrow(simGuesses) Let’s walk through each step of this calculation: First, pvalue_tea will be the name of our calculated \\(p\\)-value and the assignment operator <- directs us to this naming. We are working with the simGuesses data frame here so that comes immediately before the pipe operator. We would like to only focus on the rows in our simGuesses data frame that have heads values of 9 or 10. This represents simulated statistics “as extreme or more extreme” than what we observed (9 correct guesses out of 10). To get a glimpse of what we have up to this point, run simGuesses %>% filter(heads >= 9) %>% View(). Now that we have changed the focus to only those rows that have number of heads out of 10 flips corresponding to 9 or more, we count how many of those there are. The function nrow gives how many entries are in this filtered data frame and lastly we calculate the proportion that are at least as extreme as our observed value of 9 by dividing by the number of total simulations (10,000). We can see that the observed statistic of 9 correct guesses is not a likely outcome assuming the null hypothesis is true. Only around 1% of the outcomes in our 10,000 simulations fall at or above 9 successes. We have evidence supporting the conclusion that the person is actually better than just guessing at random at determining whether milk has been added first or not. To better visualize this we can also make use of blue shading on the histogram corresponding to the \\(p\\)-value: library(ggplot2) ggplot(data = simGuesses, aes(x = factor(heads), fill = (heads >= 9))) + geom_bar() + labs(x = "heads") Figure 7.4: Barplot of heads with p-value highlighted This helps us better see just how few of the values of heads are at our observed value or more extreme. This idea of a \\(p\\)-value can be extended to the more traditional methods using normal and \\(t\\) distributions in the traditional way that introductory statistics has been presented. These traditional methods were used because statisticians haven’t always been able to do 10,000 simulations on the computer within seconds. We’ll elaborate on this more in a few sections. Learning check (LC7.6) How could we make Table 7.1 into a tidy data frame? (LC7.7) What is meant by “pseudo-random number generation?” (LC7.8) How can simulation be used to help us address the question of whether or not an observed result is statistically significant? (LC7.9) In Chapter 4, we noted that barplots should be used when creating a plot of categorical variables. Why are we using barplots to make a plot of a numerical variable heads in this chapter? 7.7 EXAMPLE: Comparing two means 7.7.1 Randomization/Permutation We will now focus on building hypotheses looking at the difference between two population means in an example. We will denote population means using the Greek symbol \\(\\mu\\) (pronounced “mu”). Thus, we will be looking to see if one group “out-performs” another group. This is quite possibly the most common type of statistical inference and serves as a basis for many other types of analyses when comparing the relationship between two variables. Our null hypothesis will be of the form \\(H_0: \\mu_1 = \\mu_2\\), which can also be written as \\(H_0: \\mu_1 - \\mu_2 = 0\\). Our alternative hypothesis will be of the form \\(H_0: \\mu_1 \\star \\mu_2\\) (or \\(H_a: \\mu_1 - \\mu_2 \\, \\star \\, 0\\)) where \\(\\star\\) = \\(<\\), \\(\\ne\\), or \\(>\\) depending on the context of the problem. You needn’t focus on these new symbols too much at this point. It will just be a shortcut way for us to describe our hypotheses. As we saw earlier, simulation is a valuable tool when conducting inferences based on one population variable. We will see that the process of randomization (also known as permutation) will be valuable in conducting tests comparing quantitative values from two groups. 7.7.2 Comparing Action and Romance Movies The movies data set in the ggplot2movies package contains information on a large number of movies that have been rated by users of IMDB.com (Wickham 2015). We are interested in the question here of whether Action movies are rated higher on IMDB than Romance movies. We will first need to do a little bit of data manipulation using the ideas from Chapter 5 to get the data in the form that we would like: library(dplyr) library(ggplot2movies) (movies_trimmed <- movies %>% select(title, year, rating, Action, Romance)) ## # A tibble: 58,788 × 5 ## title year rating Action Romance ## <chr> <int> <dbl> <int> <int> ## 1 $ 1971 6.4 0 0 ## 2 $1000 a Touchdown 1939 6.0 0 0 ## 3 $21 a Day Once a Month 1941 8.2 0 0 ## 4 $40,000 1996 8.2 0 0 ## 5 $50,000 Climax Show, The 1975 3.4 0 0 ## 6 $pent 2000 4.3 0 0 ## 7 $windle 2002 5.3 1 0 ## 8 '15' 2002 6.7 0 0 ## 9 '38 1987 6.6 0 0 ## 10 '49-'17 1917 6.0 0 0 ## # ... with 58,778 more rows Note that Action and Romance are binary variables here. To remove any overlap of movies (and potential confusion) that are both Action and Romance, we will remove them from our population: movies_trimmed <- movies_trimmed %>% filter(!(Action == 1 & Romance == 1)) We will now create a new variable called genre that specifies whether a movie in our movies_trimmed data frame is an "Action" movie, a "Romance" movie, or "Neither". We aren’t really interested in the "Neither" category here so we will exclude those rows as well. Lastly, the Action and Romance columns are not needed anymore since they are encoded in the genre column. movies_trimmed <- movies_trimmed %>% mutate(genre = ifelse(Action == 1, "Action", ifelse(Romance == 1, "Romance", "Neither"))) %>% filter(genre != "Neither") %>% select(-Action, -Romance) We are left with 8878 movies in our population data set that focuses on only "Action" and "Romance" movies. Learning check (LC7.10) Why are the different genre variables stored as binary variables (1s and 0s) instead of just listing the genre as a column of values like “Action”, “Comedy”, etc.? (LC7.11) What complications could come above with us excluding action romance movies? Should we question the results of our hypothesis test? Explain. Let’s now visualize the distributions of rating across both levels of genre. Think about what type(s) of plot is/are appropriate here before you proceed: library(ggplot2) ggplot(data = movies_trimmed, aes(x = genre, y = rating)) + geom_boxplot() Figure 7.5: Rating vs genre in the population We can see that the middle 50% of ratings for "Action" movies is more spread out than that of "Romance" movies in the population. "Romance" has outliers at both the top and bottoms of the scale though. We are initially interested in comparing the mean rating across these two groups so a faceted histogram may also be useful: ggplot(data = movies_trimmed, mapping = aes(x = rating)) + geom_histogram(binwidth = 1, color = "white", fill = "dodgerblue") + facet_grid(genre ~ .) Figure 7.6: Faceted histogram of genre vs rating Important note: Remember that we hardly ever have access to the population values as we do here. This example and the nycflights13 data set were used to create a common flow from chapter to chapter. In nearly all circumstances, we’ll be needing to use only a sample of the population to try to infer conclusions about the unknown population parameter values. These examples do show a nice relationship between statistics (where data is usually small and more focused on experimental settings) and data science (where data is frequently large and collected without experimental conditions). 7.7.3 Sampling \\(\\rightarrow\\) Randomization We can use hypothesis testing to investigate ways to determine, for example, whether a treatment has an effect over a control and other ways to statistically analyze if one group performs better than, worse than, or different than another. We will also use confidence intervals to determine the size of the effect, if it exists. You’ll see more on this in Chapter 8. We are interested here in seeing how we can use a random sample of action movies and a random sample of romance movies from movies to determine if a statistical difference exists in the mean ratings of each group. Learning check (LC7.12) Define the relevant parameters here in terms of the populations of movies. 7.7.4 Data Let’s select a random sample of 34 action movies and a random sample of 34 romance movies. (The number 34 was chosen somewhat arbitrarily here.) library(dplyr) library(mosaic) set.seed(2016) movies_genre_sample <- movies_trimmed %>% group_by(genre) %>% sample_n(34) We can now observe the distributions of our two sample ratings for both groups. Remember that these plots should be rough approximations of our population distributions of movie ratings for "Action" and "Romance" in our population of all movies in the movies data frame. ggplot(data = movies_genre_sample, aes(x = genre, y = rating)) + geom_boxplot() Figure 7.7: Genre vs rating for our sample ggplot(data = movies_genre_sample, mapping = aes(x = rating)) + geom_histogram(binwidth = 1, color = "white", fill = "dodgerblue") + facet_grid(genre ~ .) Figure 7.8: Genre vs rating for our sample as faceted histogram Learning check (LC7.13) What single value could we change to improve the approximation using the sample distribution on the population distribution? Do we have reason to believe, based on the sample distributions of rating over the two groups of genre, that there is a significant difference between the mean rating for action movies compared to romance movies? It’s hard to say just based on the plots. The boxplot does show that the median sample rating is higher for romance movies, but the histogram isn’t as clear. The two groups have somewhat differently shaped distributions but they are both over similar values of rating. It’s often useful to calculate the mean and standard deviation as well, conditioned on the two levels. summary_ratings <- movies_genre_sample %>% group_by(genre) %>% summarize(mean = mean(rating), std_dev = sd(rating), n = n()) summary_ratings %>% kable() genre mean std_dev n Action 5.197059 1.464837 34 Romance 6.026471 1.202096 34 Learning check (LC7.14) Why did we not specify na.rm = TRUE here as we did in Chapter 5? We see that the sample mean rating for romance movies, \\(\\bar{x}_{r}\\), is greater than the similar measure for action movies, \\(\\bar{x}_a\\). But is it statistically significantly greater (thus, leading us to conclude that the means are statistically different)? The standard deviation can provide some insight here but with these standard deviations being so similar it’s still hard to say for sure. Learning check (LC7.15) Why might the standard deviation provide some insight about the means being statistically different or not? 7.7.5 Model of \\(H_0\\) The hypotheses we specified can also be written in another form to better give us an idea of what we will be simulating to create our null distribution. \\(H_0: \\mu_r - \\mu_a = 0\\) \\(H_a: \\mu_r - \\mu_a \\ne 0\\) 7.7.6 Test Statistic \\(\\delta\\) We are, therefore, interested in seeing whether the difference in the sample means, \\(\\bar{x}_r - \\bar{x}_a\\), is statistically different than 0. R has a built-in command that can calculate the difference in these two sample means. 7.7.7 Observed effect \\(\\delta^*\\) mean_ratings <- movies_genre_sample %>% group_by(genre) %>% summarize(mean = mean(rating)) obs_diff <- diff(mean_ratings$mean) We see here that the diff function calculates \\(\\bar{x}_r - \\bar{x}_a = 6.0264706 - 5.1970588 = 0.8294118\\). We will now proceed similarly to how we conducted the hypothesis test above for the Lady Tasting Tea using simulation. Our goal is figure out a random process with which to simulate the null hypothesis being true. Earlier in this chapter, we used flipping of a fair coin as the random process we were simulating with the null hypothesis being true (\\(H_0: \\tau = 5\\)). 7.7.8 Simulated Data Tactile simulation Here, with us assuming the two population means are equal (\\(H_0: \\mu_r - \\mu_a = 0\\)), we can look at this from a tactile point of view by using index cards. There are \\(n_r = 34\\) data elements corresponding to romance movies and \\(n_a = 34\\) for action movies. We can write the 34 ratings from our sample for romance movies on one set of 34 index cards and the 34 ratings for action movies on another set of 34 index cards. (Note that the sample sizes need not be the same.) The next step is to put the two stacks of index cards together, creating a new set of 68 cards. If we assume that the two population means are equal, we are saying that there is no association between ratings and genre (romance vs action). We can use the index cards to create two new stacks for romance and action movies. First, we must shuffle all the cards thoroughly. After doing so, in this case with equal values of sample sizes, we split the deck in half. We then calculate the new sample mean rating of the romance deck, and also the new sample mean rating of the action deck. This creates one simulation of the samples that were collected originally. We next want to calculate a statistic from these two samples. Instead of actually doing the calculation using index cards, we can use R as we have before to simulate this process. library(mosaic) shuffled_ratings <- movies_trimmed %>% mutate(rating = shuffle(rating)) %>% group_by(genre) %>% summarize(mean = mean(rating)) diff(shuffled_ratings$mean) ## [1] -0.02287811 Learning check (LC7.16) How would the tactile shuffling of index cards change if we had different samples of say 20 action movies and 60 romance movies? Describe each step that would change. (LC7.17) Why are we taking the difference in the means of the cards in the new shuffled decks? 7.7.9 Distribution of \\(\\delta\\) under \\(H_0\\) The only new command here is shuffle from the mosaic package, which does what we would expect it to do. It simulates a shuffling of the ratings between the two levels of genre just as we could have done with index cards. We can now proceed in a similar way to what we have done previously with the Lady Tasting Tea example by repeating this process many times to create a null distribution of simulated differences in sample means. set.seed(2016) many_shuffles <- do(10000) * (movies_trimmed %>% mutate(rating = shuffle(rating)) %>% group_by(genre) %>% summarize(mean = mean(rating)) ) It is a good idea here to View the many_shuffles data frame via View(many_shuffles). We need to figure out a way to subtract the first value of mean from the second value of mean for each of the 10,000 simulations. This is a little tricky but the group_by function comes to our rescue here: rand_distn <- many_shuffles %>% group_by(.index) %>% summarize(diffmean = diff(mean)) We can now plot the distribution of these simulated differences in means: ggplot(data = rand_distn, aes(x = diffmean)) + geom_histogram(color = "white", bins = 20) Figure 7.9: Simulated differences in means histogram 7.7.10 The p-value Remember that we are interested in seeing where our observed sample mean difference of 0.8294118 falls on this null/randomization distribution. We are interested in simply a difference here so “more extreme” corresponds to values in both tails on the distribution. Let’s shade our null distribution to show a visual representation of our \\(p\\)-value: ggplot(data = rand_distn, aes(x = diffmean, fill = (abs(diffmean) >= obs_diff))) + geom_histogram(color = "white", bins = 20) Figure 7.10: Shaded histogram to show p-value You may initially think there is an error here, but remember that the observed difference in means was 0.8294118. It falls far outside the range of simulated differences. We can add a vertical line to represent both it and its negative (since this is a two-tailed test) instead: ggplot(data = rand_distn, aes(x = diffmean)) + geom_histogram(color = "white", bins = 100) + geom_vline(xintercept = obs_diff, color = "red") + geom_vline(xintercept = -obs_diff, color = "red") Figure 7.11: Histogram with vertical lines corresponding to observed statistic Based on this plot, we have no values as extreme or more extreme than our observed effect in both directions so we have evidence supporting the conclusion that the mean rating for romance movies is different from that of action movies. (It doesn’t really matter what significance level was chosen in this case. Think about why.) The next important idea is to better understand just how much higher of a mean rating can we expect the romance movies to have compared to that of action movies. This can be addressed by creating a 95% confidence interval as we will explore in Chapter 8. Learning check (LC7.18) Conduct the same analysis comparing action movies versus romantic movies using the median rating instead of the mean rating? Make sure to use the %>% as much as possible. What was different and what was the same? (LC7.19) What conclusions can you make from viewing the faceted histogram looking at rating versus genre that you couldn’t see when looking at the boxplot? (LC7.20) Describe in a paragraph how we used Allen Downey’s diagram to conclude if a statistical difference existed between mean movie ratings for action and romance movies. (LC7.21) Why are we relatively confident that the distributions of the sample ratings will be good approximations of the population distributions of ratings for the two genres? (LC7.22) Using the definition of “\\(p\\)-value”, write in words what the \\(p\\)-value represents for the hypothesis test above comparing the mean rating of romance to action movies. (LC7.23) What is the value of the \\(p\\)-value for the hypothesis test comparing the mean rating of romance to action movies? (LC7.24) Do the results of the hypothesis test match up with the original plots we made looking at the population of movies? Why or why not? 7.7.11 Summary To review, these are the steps one would take whenever you’d like to do a hypothesis test comparing values from the distributions of two groups: Simulate many samples using a random process that matches the way the original data were collected and that assumes the null hypothesis is true. Collect the values of a sample statistic for each sample created using this random process to build a randomization distribution. Assess the significance of the original sample by determining where its sample statistic lies in the randomization distribution. If the proportion of values as extreme or more extreme than the observed statistic in the randomization distribution is smaller than the pre-determined significance level \\(\\alpha\\), we reject \\(H_0\\). Otherwise, we fail to reject \\(H_0\\). (If no significance level is given, one can assume \\(\\alpha = 0.05\\).) 7.8 Building theory-based methods using computation As a point of reference, we will now discuss the traditional theory-based way to conduct the hypothesis test for determining if there is a statistically significant difference in the sample mean rating of Action movies versus Romance movies. This method and ones like it work very well when the assumptions are met in order to run the test. They are based on probability models and distributions such as the normal and \\(t\\)-distributions. These traditional methods have been used for many decades back to the time when researchers didn’t have access to computers that could run 10,000 simulations in under a minute. They had to base their methods on probability theory instead. Many fields and researchers continue to use these methods and that is the biggest reason for their inclusion here. It’s important to remember that a \\(t\\)-test or a \\(z\\)-test is really just an approximation of what you have seen in this chapter already using simulation and randomization. The focus here is on understanding how the shape of the \\(t\\)-curve comes about without digging big into the mathematical underpinnings. 7.8.1 EXAMPLE: \\(t\\)-test for two independent samples What is commonly done in statistics is the process of normalization. What this entails is calculating the mean and standard deviation of a variable. Then you subtract the mean from each value of your variable and divide by the standard deviation. The most common normalization is known as the \\(z\\)-score. The formula for a \\(z\\)-score is \\[Z = \\frac{x - \\mu}{\\sigma},\\] where \\(x\\) represent the value of a variable, \\(\\mu\\) represents the mean of the variable, and \\(\\sigma\\) represents the standard deviation of the variable. Thus, if your variable has 10 elements, each one has a corresponding \\(z\\)-score that gives how many standard deviations away that value is from its mean. \\(z\\)-scores are normally distributed with mean 0 and standard deviation 1. They have the common, bell-shaped pattern seen below. Recall, that we hardly ever know the mean and standard deviation of the population of interest. This is almost always the case when considering the means of two independent groups. To help account for us not knowing the population parameter values, we can use the sample statistics instead, but this comes with a bit of a price in terms of complexity. Another form of normalization occurs when we need to use the sample standard deviations as estimates for the unknown population standard deviations. This normalization is often called the \\(t\\)-score. For the two independent samples case like what we have for comparing action movies to romance movies, the formula is \\[T =\\dfrac{ (\\bar{x}_1 - \\bar{x}_2) - (\\mu_1 - \\mu_2)}{ \\sqrt{\\dfrac{{s_1}^2}{n_1} + \\dfrac{{s_2}^2}{n_2}} }\\] There is a lot to try to unpack here. \\(\\bar{x}_1\\) is the sample mean response of the first group \\(\\bar{x}_2\\) is the sample mean response of the second group \\(\\mu_1\\) is the population mean response of the first group \\(\\mu_2\\) is the population mean response of the second group \\(s_1\\) is the sample standard deviation of the response of the first group \\(s_2\\) is the sample standard deviation of the response of the second group \\(n_1\\) is the sample size of the first group \\(n_2\\) is the sample size of the second group Assuming that the null hypothesis is true (\\(H_0: \\mu_1 - \\mu_2 = 0\\)), \\(T\\) is said to be distributed following a \\(t\\) distribution with degrees of freedom equal to the smaller value of \\(n_1 - 1\\) and \\(n_2 - 1\\). The “degrees of freedom” can be thought of measuring how different the \\(t\\) distribution will be as compared to a normal distribution. Small sample sizes lead to small degrees of freedom and, thus, \\(t\\) distributions that have more values in the tails of their distributions. Large sample sizes lead to large degrees of freedom and, thus, \\(t\\) distributions that closely align with the standard normal, bell-shaped curve. So, assuming \\(H_0\\) is true, our formula simplifies a bit: \\[T =\\dfrac{ \\bar{x}_1 - \\bar{x}_2}{ \\sqrt{\\dfrac{{s_1}^2}{n_1} + \\dfrac{{s_2}^2}{n_2}} }.\\] We have already built an approximation for what we think the distribution of \\(\\delta = \\bar{x}_1 - \\bar{x}_2\\) looks like using randomization above. Recall this distribution: ggplot(data = rand_distn, aes(x = diffmean)) + geom_histogram(color = "white", bins = 20) Figure 7.12: Simulated differences in means histogram If we’d like to have a guess as to what the distribution of \\(T\\) might look like instead, we need only to divide every value in rand_distn by \\[\\sqrt{\\dfrac{{s_1}^2}{n_1} + \\dfrac{{s_2}^2}{n_2}}.\\] As we did before, we will assign Romance to be group 1 and Action to be group 2. (This was done since Romance comes second alphabetically and the reason why we have a number mismatch below with 1 and 2.) Remember that we’ve already calculated these values: kable(summary_ratings) genre mean std_dev n Action 5.197059 1.464837 34 Romance 6.026471 1.202096 34 We will create some shortcuts here so you can see the value being calculated for the denominator of \\(T\\). s1 <- summary_ratings$std_dev[2] s2 <- summary_ratings$std_dev[1] n1 <- summary_ratings$n[2] n2 <- summary_ratings$n[1] Here, we have \\(s_1 = 1.2020964\\), \\(s_2 = 1.4648374\\), \\(n_1 = 34\\), and \\(n_2 = 34\\). We can calculate the denominator via (denom_T <- sqrt( (s1^2 / n1) + (s2^2 / n2) )) ## [1] 0.3249789 Now if we divide all of the values of diffmean in rand_distn by denom_T we can have a simulated distribution of \\(T\\) test statistics instead: rand_distn <- rand_distn %>% mutate(t_stat = diffmean / denom_T * 10) ggplot(data = rand_distn, aes(x = t_stat)) + geom_histogram(color = "white", bins = 20) Figure 7.13: Simulated T statistics histogram We see that the shape of this distribution is the same as that of diffmean. The scale has changed though with t_stat having less spread than diffmean. A traditional \\(t\\)-test doesn’t look at this simulated distribution, but instead it looks at the \\(t\\)-curve with degrees of freedom equal to 33 (the minimum of \\(n_1 = 34 - 1 = 33\\) and \\(n_2 = 34 - 1 = 33\\)). This curve is frequently called a density curve and this is the reason why we specify the use of y = ..density.. here in the geom_histogram. We now overlay what this \\(t\\)-curve looks like on top of the histogram showing the simulated \\(T\\) statistics. ggplot(data = rand_distn, mapping = aes(x = t_stat)) + geom_histogram(aes(y = ..density..), color = "white", binwidth = 0.3) + stat_function(fun = dt, args = list(df = min(n1 - 1, n2 - 1)), color = "royalblue", size = 2) We can see that the curve does a good job of approximating the randomization distribution here. (More on when to expect for this to be the case when we discuss conditions for the \\(t\\)-test in a bit.) To calculate the \\(p\\)-value in this case, we need to figure out how much of the total area under the \\(t\\)-curve is at our observed \\(T\\)-statistic or more, plus also adding the area under the curve at the negative value of the observed \\(T\\)-statistic or below. (Remember this is a two-tailed test so we are looking for a difference–values in the tails of either direction.) Just as we converted all of the simulated values to \\(T\\)-statistics, we must also do so for our observed effect \\(\\delta^*\\): (t_obs <- obs_diff / denom_T) ## [1] 2.552202 So graphically we are interested in finding the percentage of values that are at or above 2.5522017 or at or below -2.5522017. ggplot(data = rand_distn, mapping = aes(x = t_stat)) + stat_function(fun = dt, args = list(df = min(n1 - 1, n2 - 1)), color = "royalblue", size = 2) + geom_vline(xintercept = t_obs, color = "red") + geom_vline(xintercept = -t_obs, color = "red") At this point, you should make a guess as to what a reasonable value may be for the \\(p\\)-value. Let’s say the \\(p\\)-value is 0.01 or so. To actually perform this calculation by hand, you’d need to do some calculus. Let’s have R do it for us instead using the pt function. pt(t_obs, df = min(n1 - 1, n2 - 1), lower.tail = FALSE) + pt(-t_obs, df = min(n1 - 1, n2 - 1), lower.tail = TRUE) ## [1] 0.01551859 7.8.2 Conditions for t-test In order for the results of the \\(t\\)-test to be valid, three conditions must be met: Independent observations in both samples Nearly normal populations OR large sample sizes (\\(n \\ge 30\\)) Independently selected samples Condition 1: This is met since we sampled at random using R from our population. Condition 2: Recall from Figure 7.6, that we know how the populations are distributed. Both of them are close to normally distributed. If we are a little concerned about this assumption, we also do have samples of size larger than 30 (\\(n_1 = n_2 = 34\\)). Condition 3: This is met since there is no natural pairing of a movie in the Action group to a movie in the Romance group. Since all three conditions are met, we can be reasonably certain that the theory-based test will match the results of the randomization-based test using shuffling. Remember that theory-based tests can produce some incorrect results in these assumptions are not carefully checked. The only assumption for randomization and computational-based methods is that the sample is selected at random. They are our preference and we strongly believe they should be yours as well, but it’s important to also see how the theory-based tests can be done and used as an approximation for the computational techniques until at least more researchers are using these techniques that utilize the power of computers. 7.9 Conclusion 7.9.1 Script of R code An R script file of all R code used in this chapter is available here. 7.9.2 What’s to come? This chapter examined the basics of hypothesis testing with terminology and also an example of how to apply the “There is Only One Test” diagram to the Lady Tasting Tea example presented in Chapter 6 and to an example on comparing the IMDB ratings of action movies and romance movies. We’ll see in Chapter 8 how we can provide a range of possible values for an unknown population parameter instead of just running a Yes/No decision from a hypothesis test. We will see in Chapter 9 many of the same ideas we have seen with hypothesis testing and confidence intervals in the last two chapters. Regression is frequently associated both correctly and incorrectly with statistics and data analysis, so you’ll need to make sure you understand when it is appropriate and when it is not. References "],
+["8-ci.html", "8 Confidence Intervals 8.1 Bootstrapping 8.2 Relation to hypothesis testing 8.3 Effect size 8.4 Conclusion", " 8 Confidence Intervals Definition: Confidence Interval A confidence interval gives a range of plausible values for a parameter. It depends on a specified confidence level with higher confidence levels corresponding to wider confidence intervals and lower confidence levels corresponding to narrower confidence intervals. Common confidence levels include 90%, 95%, and 99%. Usually we don’t just begin chapters with a definition, but confidence intervals are simple to define and play an important role in the sciences and any field that uses data. You can think of a confidence interval as playing the role of a net when fishing. Instead of just trying to catch a fish with a single spear (estimating an unknown parameter by using a single point estimate/statistic), we can use a net to try to provide a range of possible locations for the fish (use a range of possible values based around our statistic to make a plausible guess as to the location of the parameter). Needed packages library(dplyr) library(ggplot2) library(mosaic) library(knitr) 8.1 Bootstrapping Just as we did in Chapter 7 with the Lady Tasting Tea when making hypotheses about a population total with which we would like to test which one is more plausible, we can also use computation to infer conclusions about a population quantitative statistic such as the mean. In this case, we will focus on constructing confidence intervals to produce plausible values for a population mean. (We can do a similar analysis for a population median or other summary measure as well.) Traditionally, the way to construct confidence intervals for a mean is to assume a normal distribution for the population or to invoke the Central Limit Theorem and get, what often appears to be magic, results. (This is similar to what was done in Section 7.8.) These methods are often not intuitive, especially for those that lack a strong mathematical background. They also come with their fair share of assumptions and often turn Statistics, a field that is full of tons of useful applications to many different fields and disciplines, into a robotic procedural-based topic. It doesn’t have to be that way! In this section, we will introduce the concept of bootstrapping. It will be a useful tool that will allow us to estimate the variability of our statistic from sample to sample. One neat feature of bootstrapping is that it enables us to approximate the sampling distribution and estimate the distribution’s standard deviation using ONLY the information in the one selected (original) sample. It sounds just as plagued with the magical type qualities of traditional theory-based inference on initial glance but we will see that it provides an intuitive and useful way to make inferences, especially when the samples are of medium to large size. To introduce the concept of bootstrapping, we again will use the movies data set in the ggplot2movies data frame. Remember that we load this data frame into R in much the same way as we loaded flights and weather from the nycflights13 package. library(ggplot2movies) data(movies, package = "ggplot2movies") Recall that you can also glance at this data frame using the View function and look at the help documentation for movies using the ? function. We will explore many other features of this data set in the chapters to come, but here we will be focusing on the rating variable corresponding to the average IMDB user rating. You may notice that this data set is quite large: 58,788 movies have data collected about them here. This will correspond to our population of ALL movies. Remember from Chapter 6 that our population is rarely known. We use this data set as our population here to show you the power of bootstrapping in estimating population parameters. We’ll see how confidence intervals built using the bootstrap distribution perform at including our population parameter of interest. Here we can actually calculate these values since our population is known, but remember that in general this isn’t the case. Let’s take a look at what the distribution of our population ratings looks like. We’ll see that we will use the distribution of our sample(s) as an estimate of this population histogram. movies %>% ggplot(aes(x = rating)) + geom_histogram(color = "white", bins = 20) Figure 8.1: Population ratings histogram Learning check (LC8.1) Why was a histogram chosen as the plot to make for the rating variable above? (LC8.2) What does the shape of the rating histogram tell us about how IMDB users rate movies? What stands out about the plot? It’s important to think about what our goal is here. We would like to produce a confidence interval for the population mean rating. We will have to pretend for a moment that we don’t have all 58,788 movies. Let’s say that we only have a random sample of 50 movies from this data set instead. In order to get a random sample, we can use the resample function in the mosaic package with replace = FALSE. We could also use the sample_n function from dplyr. set.seed(2017) library(mosaic) library(dplyr) movies_sample <- movies %>% resample(size = 50, replace = FALSE) The resample function has filtered the data frame movies “at random” to choose only 50 rows from the larger movies data frame. We store information on these 50 movies in the movies_sample data frame. Let’s now explore what the rating variable looks like for these 50 movies: movies_sample %>% ggplot(aes(x = rating)) + geom_histogram(color = "white", bins = 20) Figure 8.2: Sample ratings histogram Remember that we can think of this histogram as an estimate of our population distribution histogram that we saw above. We are interested in the population mean rating and trying to find a range of plausible values for that value. A good start in guessing the population mean is to use the mean of our sample rating from the movies_sample data: (movies_sample_mean <- movies_sample %>% summarize(mean = mean(rating))) ## # A tibble: 1 × 1 ## mean ## <dbl> ## 1 5.894 Note the use of the ( ) at the beginning and the end of this creation of the movies_sample_mean object. If you’d like to print out your newly created object, you can enclose it in the parentheses as we have here. This value of 5.894 is just one guess at the population mean. The idea behind bootstrapping is to sample with replacement from the original sample to create new resamples of the same size as our original sample. Returning to our example, let’s investigate what one such resample of the movies_sample data set accomplishes. We can create one resample/bootstrap sample by using the resample function in the mosaic package. boot1 <- resample(movies_sample) %>% arrange(orig.id) The important thing to note here is the original row numbers from the movies_sample data frame in the far right column called orig.ids. Since we are sampling with replacement, there is a strong likelihood that some of the 50 observational units are going to be selected again. You may be asking yourself what does this mean and how does this lead us to creating a distribution for the sample mean. Recall that the original sample mean of our data was calculated using the summarize function above. Learning check (LC8.3) What happens if we change the seed to our pseudo-random generation? Try it above when we used resample to describe the resulting movies_sample. (LC8.4) Why is sampling at random important from the movies data frame? Why don’t we just pick Action movies and do bootstrapping with this Action movies subset? (LC8.5) What was the purpose of assuming we didn’t have access to the full movies data set here? Before we had a calculated mean in our original sample of 5.894. Let’s calculate the mean of ratings in our bootstrapped sample: (movies_boot1_mean <- boot1 %>% summarize(mean = mean(rating))) ## # A tibble: 1 × 1 ## mean ## <dbl> ## 1 5.686 More than likely the calculated bootstrap sample mean is different than the original sample mean. This is what was meant earlier by the sample means having some variability. What we are trying to do is replicate many different samples being taken from a larger population. Our best guess at what the population looks like is multiple copies of the sample we collected. We then can sample from that larger “created” population by generating bootstrap samples. Similar to what we did in the previous section, we can repeat this process using the do function followed by an asterisk. Let’s look at 10 different bootstrap means for ratings from movies_sample. Note the use of the resample function here. do(10) * (resample(movies_sample) %>% summarize(mean = mean(rating))) ## mean ## 1 5.942 ## 2 5.572 ## 3 5.828 ## 4 6.292 ## 5 6.032 ## 6 5.920 ## 7 5.996 ## 8 5.852 ## 9 6.098 ## 10 5.608 You should see some variability begin to tease its way out here. Many of the simulated means will be close to our original sample mean but many will stray pretty far away. This occurs because outliers may have been selected a couple of times in the resampling or small values were selected more than larger. There are myriad reasons why this might be the case. So what’s the next step now? Just as we repeated the repetitions thousands of times with the “Lady Tasting Tea” example, we can do a similar thing here: trials <- do(10000) * summarize(resample(movies_sample), mean = mean(rating)) ggplot(data = trials, mapping = aes(x = mean)) + geom_histogram(bins = 30, color = "white") Figure 8.3: Bootstrapped means histogram The shape of this resulting distribution may look familiar to you. It resembles the well-known normal (bell-shaped) curve. At this point, we can easily calculate a confidence interval. In fact, we have a couple different options. We will first use the percentiles of the distribution we just created to isolate the middle 95% of values. This will correspond to our 95% confidence interval for the population mean rating, denoted by \\(\\mu\\). (ciq_mean_rating <- confint(trials, level = 0.95, method = "quantile")) ## name lower upper level method estimate ## 1 mean 5.456 6.296 0.95 percentile 5.894 It’s always important at this point to interpret the results of this confidence interval calculation. In this context, we can say something like the following: Based on the sample data and bootstrapping techniques, we can be 95% confident that the true mean rating of ALL IMDB ratings is between 5.456 and 6.296. This statement may seem a little confusing to you. Another way to think about this is that this confidence interval was constructed using the sample data by a procedure that is 95% reliable. We will get invalid results 5% of the time. Just as we had a trade-off with \\(\\alpha\\) and \\(\\beta\\) with hypothesis tests, we have a similar trade-off here with setting the confidence level. To further reiterate this point, the graphic below from Diez, Barr, and Çetinkaya-Rundel (2014) shows us that if we repeated a confidence interval process 25 times with 25 different samples, we would expect about 95% of them to actually contain the population parameter of interest. This parameter is marked with a dotted vertical line. We can see that only one confidence interval does not overlap with this value. (The one marked in red.) Therefore 24 in 25 (96%), which is quite close to our 95% reliability, do include the population parameter. Figure 8.4: Confidence interval coverage plot from OpenIntro Remember that we are pretending like we don’t know what the mean IMDB rating for ALL movies is. Our population here is all of the movies listed in the movies data frame from ggplot2movies. So does our bootstrapped confidence interval here contain the actual mean value? movies %>% summarize(mean_rating = mean(rating)) ## # A tibble: 1 × 1 ## mean_rating ## <dbl> ## 1 5.93285 We see here that the population mean does fall in our range of plausible values generated from the bootstrapped samples. We can also get an idea of how the theory-based inference techniques would have approximated this confidence interval by using the formula \\[\\bar{x} \\pm (2 * SE),\\] where \\(\\bar{x}\\) is our original sample mean and \\(SE\\) stands for standard error and corresponds to the standard deviation of the bootstrap distribution. The value of 2 here corresponds to it being a 95% confidence interval. (95% of the values in a normal distribution fall within 2 standard deviations of the mean.) This formula assumes that the bootstrap distribution is symmetric and bell-shaped. This is often the case with bootstrap distributions, especially those in which the original distribution of the sample is not highly skewed. Definition: standard error The standard error is the standard deviation of the sampling distribution. The sampling distribution may be approximated by the bootstrap distribution or the null distribution depending on the context. Traditional theory-based methodologies for inference also have formulas for standard errors, assuming some conditions are met. To compute this type of confidence interval, we only need to make a slight modification to the confint function seen above. (The expression after the \\(\\pm\\) sign is known as the margin of error.) (cise_mean_rating <- confint(trials, level = 0.95, method = "stderr")) ## name lower upper level method estimate margin.of.error ## 1 mean 5.468465 6.314277 0.95 stderr 5.894 0.4229063 Based on the sample data and bootstrapping techniques, we can be 95% confident that the true mean rating of ALL IMDB ratings is between 5.4684649 and 6.3142775. Learning check (LC8.6) Reproduce the bootstrapping above using a sample of size 50 instead of 25. What changes do you see? (LC8.7) Reproduce the bootstrapping above using a sample of size 5 instead of 25. What changes do you see? (LC8.8) How does the sample size affect the analysis above? (LC8.9) Why must bootstrap samples be the same size as the original sample? 8.1.1 Review of Bootstrapping We can summarize the process to generate a bootstrap distribution here in a series of steps that clearly identify the terminology we will use (R. Lock et al. 2012). Generate bootstrap samples by sampling with replacement from the original sample, using the same sample size. Compute the statistic of interest, called a bootstrap statistic, for each of the bootstrap samples. Collect the statistics for many bootstrap samples to create a bootstrap distribution. Visually, we can represent this process in the following diagram. Figure 8.5: Bootstrapping diagram from Lock5 textbook 8.2 Relation to hypothesis testing Recall that we found a statistically significant difference in the sample mean of romance movie ratings compared to the sample mean of action movie ratings. We concluded Chapter 7 by attempting to understand just how much greater we could expect the population mean romance movie rating to be compared to the population mean action movie rating. In order to do so, we will calculate a confidence interval for the difference \\(\\mu_r - \\mu_a\\). We’ll then go back to our population parameter values and see if our confidence interval contains our parameter value. We could use bootstrapping in a way similar to that done above, except now on a difference in sample means, to create a distribution and then use the confint function with the option of quantile to determine a confidence interval for the plausible values of the difference in population means. This is an excellent programming activity and the reader is urged to try to do so. Recall what the randomization/null distribution looked like for our simulated shuffled sample means: library(ggplot2) library(dplyr) ggplot(data = rand_distn, mapping = aes(x = diffmean)) + geom_histogram(color = "white", bins = 20) Figure 8.6: Simulated shuffled sample means histogram With this null distribution being quite symmetric and bell-shaped, the standard error method introduced above likely provides a good estimate of a range of plausible values for \\(\\mu_r - \\mu_a\\). Another nice option here is that we can use the standard deviation of the null/randomization distribution we just found with our hypothesis test. (std_err <- rand_distn %>% summarize(se = sd(diffmean))) ## # A tibble: 1 × 1 ## se ## <dbl> ## 1 0.03182225 We can use the general formula of \\(statistic \\pm (2 * SE)\\) for a confidence interval to obtain the following result for plausible values of the difference in population means at the 95% level. (lower <- obs_diff - (2 * std_err)) ## se ## 1 0.7657673 (upper <- obs_diff + (2 * std_err)) ## se ## 1 0.8930563 We can, therefore, say that we are 95% confident that the population mean rating for romance movies is between 0.766 and 0.893 points higher than for that of action movies. The important thing to check here is whether 0 is contained in the confidence interval. If it is, it is plausible that the difference in the two population means between the two groups is 0. This means that the null hypothesis is plausible. The results of the hypothesis test and the confidence interval should match as they do here. We rejected the null hypothesis with hypothesis testing and we have evidence here that the mean rating for romance movies is higher than for action movies. 8.3 Effect size The phrase effect size has been thrown around recently as an alternative to \\(p\\)-values. In combination with the confidence interval, it can be often more valuable than just looking at the results of a hypothesis test. It depends on the scientific discipline exactly what is meant by “effect size” but, in general, it refers to the magnitude of the difference between group measurements. For our two sample problem involving movies, it is the observed difference in sample means obs_diff. It’s worthy of mention here that confidence intervals are always centered at the observed statistic. In other words, if you are looking at a confidence interval and someone asks you what the “effect size” is you can simply find the midpoint of the stated confidence interval. Learning check (LC8.10) Check to see whether the difference in population mean ratings for the two genres falls in the confidence interval we found here. Are we guaranteed that it will fall in the range of plausible values? (LC8.11) Why do you think many scientific fields are shifting to preferring inclusion of confidence intervals in articles over just \\(p\\)-values and hypothesis tests? (LC8.12) Why is 95% related to a value of 2 in the margin of error? What would approximate values be for 90% and for 99%? (LC8.13) Why is a 95% confidence interval wider than a 90% confidence interval? Explain by using a concrete example from everyday life about what is meant by “confidence.” (LC8.14) How would confidence intervals correspond to one-sided hypothesis tests? (LC8.15) There is a relationship between the significance level and the confidence level. What do you think it is? (LC8.16) The moment the phrase “standard error” is mentioned, there seems to be someone that says “The standard error is \\(s\\) divided by the square root of \\(n\\).” This standard error formula is used in the theory-based procedure for an inference on one mean. But… does it always work? For samp1, samp2, and samp3 below, do the following: produce a bootstrap distribution based on the sample calculate the standard deviation of the bootstrap distribution compare this value of the standard error to what you obtain when you calculate the standard deviation of the sample \\(s\\) divided by \\(\\sqrt{n}\\). df1 <- data_frame(samp1 = rexp(50)) df2 <- data_frame(samp2 = rnorm(100)) df3 <- data_frame(samp3 = rbeta(20,5,5)) Describe how \\(s / \\sqrt{n}\\) does in approximating the standard error for these three samples and their corresponding bootstrap distributions. 8.4 Conclusion 8.4.1 Script of R code An R script file of all R code used in this chapter is available here. 8.4.2 What’s to come? We will see in Chapter 9 many of the same ideas we have seen with hypothesis testing and confidence intervals in the last two chapters. Regression is frequently associated both correctly and incorrectly with statistics and data analysis, so you’ll need to make sure you understand when it is appropriate and when it is not. References "],
+>>>>>>> 5223f11c169d8e511baaf89c512ab1e07fdad40e
["9-regress.html", "9 Regression via broom 9.1 EXAMPLE: Alaskan Airlines delays 9.2 Correlation 9.3 Linear regression 9.4 Inference for regression 9.5 Residual analysis 9.6 Conditions for regression 9.7 Conclusion", " 9 Regression via broom One of the most commonly used statistical procedures is regression. Regression, in its simplest form, focuses on trying to predict values of one numerical variable based on the values of another numerical variable using a straight line fit to data. We saw in Chapters 7 and 8 an example of analyses using a categorical predictor (movie genre–action or romance) and a numerical response (movie rating). In this chapter, we will focus on going back to the flights data frame in the nycflights13 package to look at the relationship between departure delay and arrival delay. We will also discuss the concept of correlation and how it is frequently incorrectly implied to also lead to causation. This chapter also introduces the broom package, which is a useful tool in summarizing the results of model fits in tidy format. You will see examples of the tidy, glance, and augment functions with linear regression. Needed packages library(mosaic) library(dplyr) library(ggplot2) library(knitr) library(broom) library(nycflights13) 9.1 EXAMPLE: Alaskan Airlines delays We’ll next explore the relationship/association of departure delays and arrival delays for a sample of 100 flights departing from New York City in 2013 with Alaskan Airlines. library(nycflights13) data(flights) set.seed(2017) # Load Alaska data, deleting rows that have missing departure delay # or arrival delay data alaska_flights <- flights %>% filter(carrier == "AS") %>% filter(!is.na(dep_delay) & !is.na(arr_delay)) %>% resample(size = 50, replace = FALSE) ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + geom_point() Figure 9.1: Departure and Arrival Flight Delays for a sample of 50 Alaskan flights from NYC Learning check (LC9.1) Does there appear to be a linear relationship with arrival delay and departure delay? In other words, could you fit a line to the data and have it explain well how arr_delay increases as dep_delay increases? (LC9.2) Is there only one possible line that fits the data “well”? How could you decide on which one is best if there are multiple options? 9.2 Correlation One way to measure the linearity between two numerical variables is by using correlation. In fact, the correlation coefficient is defined as just that. Definition: Correlation Coefficient The correlation coefficient measures the strength of linear association between two variables. Properties of the correlation coefficient: It is always between -1 and 1, inclusive, where -1 indicates perfect negative relationship 0 indicates no relationship +1 indicates perfect positive relationship Learning check (LC9.3) Make a guess as to the value of the correlation cofficient between arr_delay and dep_delay in the alaska_flights data frame. (LC9.4) Do you think that the correlation coefficient between arr_delay and dep_delay is the same as the correlation coefficient between dep_delay and arr_delay? Explain. We can look at a variety of different data sets and their corresponding correlation coefficients in the following plot. Figure 9.2: Different Correlation Coefficients We can calculate the correlation coefficient for our example of flight delays via alaska_flights %>% summarize(correl = cor(dep_delay, arr_delay)) ## # A tibble: 1 × 1 ## correl ## <dbl> ## 1 0.7908 The sample correlation coefficient is denoted by \\(r\\). In this case, \\(r = 0.7908\\). Learning check (LC9.5) Would you quantify the value of correl calculated above as being strongly positively linear, weakly positively linear, not linear, weakly negatively linear, or strongly positively linear? Discuss your choice. If you’d like a little more practice in determining the linear relationship between two variables by quantifying a correlation coefficient, you should check out the Guess the Correlation game online. 9.2.1 Correlation does not imply causation Just because arrival delays are related to departure delays in a somewhat linear fashion, we can’t say with certaintly that arrival delays are caused entirely by departure delays. Certainly it appears that as one increases, the other tends to increase, but that might not always be the case. Causation is a tricky problem and frequently takes carefully designed experiments. These experiments remove confounding variables and only focus on the behavior of one variable in the presence of the levels of the other variable(s). Be careful as you read studies to make sure that the writers aren’t falling into this fallacy of correlation implying causation. If you spot one, you may want to send them a link to Spurious Correlations. Learning check (LC9.6) What are some other confounding variables besides departure delay that could attribute to an increase in arrival delays? Remember that a variable is something that has to vary! 9.3 Linear regression So we see above that there is a strong positive association between these delay variables. Let’s say that we are waiting for our flight to leave New York City on Alaskan and we are told that our flight is going to be delayed 25 minutes. What could we predict for our arrival delay based on the plot in Figure 9.1? It may be hard to pick a particular value here, especially after just going over confidence intervals in Chapter 8. One way to do this would be to fit a line that fits the data best and then use the predicted arr_delay value from that line for dep_delay = 25 as our prediction. But what is meant by “fits the data best”? The least squares/best fitting/linear regression line has been fit to the data below. Figure 9.3: Regression line fit on delays Here lm corresponds to “linear model” and we’ll see its use again in a bit when we find the values that define this line. 9.3.1 Understanding linear regression basics Let’s choose an arbitrary point on the graph and label it the color blue. Now consider this point’s deviation from the regression line. Do this for another point. And for another point. We could repeat this process for each of the points in our sample. The pattern that emerges here is that the regression line minimizes the sum of the squared arrow lengths (i.e., the least squares) for all of the points. As you look at these points you might think that a different line could fit the data better based on this criteria. That isn’t the case though and it can be shown via calculus (omitted here) that this line minimizes the sum of the squared residuals for these 50 points. 9.3.2 The equation of the line We can use R and the lm function to retrieve the equation of the line of best fit here in red. A simple linear regression such as this will produce two coeffients: one for the \\(y\\)-intercept and one for the slope. We can use the tidy function in the broom package to extract these coefficients from the model fit. delay_fit <- lm(formula = arr_delay ~ dep_delay, data = alaska_flights) tidy(delay_fit) %>% kable() term estimate std.error statistic p.value (Intercept) -14.155 2.809 -5.038 0 dep_delay 1.218 0.136 8.951 0 In general, the equation of the line of best fit for a sample is \\[\\hat{y} = b_0 + b_1 x.\\] Thus, our equation is \\(\\hat{y} = -14.155 + 1.2177 \\, x.\\) It is usually preferred to actually write the names of the variables instead of \\(x\\) and \\(y\\): \\[\\widehat{arr\\_delay} = -14.155 + 1.2177 \\, dep\\_delay.\\] We can also extract the coefficients by using the coef function: coef(delay_fit) ## (Intercept) dep_delay ## -14.155 1.218 9.3.3 Interpreting the slope After you have determined your line of best fit, it is good practice to interpret the results to see if they make sense. Slope is defined as rise over run or the change in \\(y\\) for every one unit increase in \\(x\\). For our specific example, we can say that for every one minute increase in the departure delay of Alaskan Airlines flights from NYC, we can expect the corresponding arrival delay to be 1.22 minutes more. This estimate does make some practical sense. It would be strange if arrival delays went down as departure delays increased. We also expect that the longer a flight is delayed on departure, the more likely the longer a flight is delayed on arrival. Remember that we are also using data here to make a guess as to how the population of all Alaskan flights might behave with regards to departure delays and arrival delays, so just as with other sampling procedures there is also variability in the sample estimates for the regression line. 9.3.4 Predicting values Getting back to our hypothetical flight that has been delayed 25 minutes, we can use the augment function in the broom package to get the fitted arrival delay value: delay_fit %>% augment(newdata = data_frame(dep_delay = 25)) ## dep_delay .fitted .se.fit ## 1 25 16.29 3.967 Note the use of the data_frame function here, which must be used since newdata is expecting a data frame as its argument. We must also specify that we are plugging in 25 for the value of dep_delay here. We can see that the line predicted an arrival delay of 16.29 minutes based on our 25 minute departure delay. This also does make some sense since flights that aren’t delayed greatly from the beginning to tend to make up time in the air to compensate. Important note: The correlation coefficient and the slope of the regression line are not the same thing. They will always share the same sign (positive correlation coefficients correspond to positive slope coefficients and the same holds true for negative values), but you can’t make any more conclusions about them than that. For example, say we have 3 groups of points: Their regression lines have different slopes, but \\(r = 1\\) for all 3. In other words, all three groups of points have a perfect (positive) linear relationship. 9.4 Inference for regression The population least squares line is defined by the formula \\(y = \\beta_0 + \\beta_1 x + \\epsilon\\). Here \\(\\epsilon\\) represents the error term. It corresponds to the part of the response variable \\(y\\) that remains unexplained after considering the predictor variable \\(x\\). Often it is standard practice to assume that this error term follows a normal distribution. We will focus on checking whether that assumption is valid in Section 9.5. In the population least squares line \\(y = \\beta_0 + \\beta_1 x + \\epsilon\\), we can see that if \\(\\beta_1 = 0\\) there is no relationship between \\(x\\) and \\(y\\). If \\(\\beta_1 = 0\\), \\(y = \\beta_0 + \\epsilon\\). Therefore, \\(y\\) does not depend on \\(x\\) at all in the equation. A hypothesis test is frequently conducted to check whether a relationship exists between two numerical variables \\(x\\) and \\(y\\). We can also use the concept of shuffling to determine the standard error of our null distribution and conduct a hypothesis test for a population slope. Let’s go back to our example on Alaskan flights that represent a sample of all Alaskan flights departing NYC in 2013. Let’s test to see if we have evidence that a positive relationship exists between the departure delay and arrival delay for Alaskan flights. We will set up this hypothesis testing process as we have each before via the “There is Only One Test” diagram in Figure 7.1. 9.4.1 Data Our data is stored in alaska_flights and we are focused on the 50 measurements of dep_delay and arr_delay there. 9.4.2 Test Statistic \\(\\delta\\) Our test statistic here is the sample slope coefficient that we denote with \\(b_1\\). 9.4.3 Observed effect \\(\\delta^*\\) (b1_obs <- tidy(delay_fit)$estimate[2]) ## [1] 1.218 The calculated slope value from our observed sample is \\(b_1 = 1.2177\\). 9.4.4 Model of \\(H_0\\) We are looking to see if a positive relationship exists so \\(H_a: \\beta_1 > 0\\). Our null hypothesis is always in terms of equality so we have \\(H_0: \\beta_1 = 0\\). 9.4.5 Simulated Data Now to simulate the null hypothesis being true and recreating how our sample was created, we need to think about what it means for \\(\\beta_1\\) to be zero. If \\(\\beta_1 = 0\\), we said above that there is no relationship between the departure delay and arrival delay. If there is no relationship, then any one of the arrival delay values could have just as likely occurred with any of the other departure delay values instead of the one that it actually did fall with. We, therefore, have another example of shuffling in our simulating of data. Tactile simulation We could use a deck of 100 note cards to create a tactile simulation of this shuffling process. We would write the 50 different values of departure delays on each of the 50 cards, one per card. We would then do the same thing for the 50 arrival delays putting them on one per card. Next, we would lay out each of the 50 departure delay cards and we would shuffle the arrival delay deck. Then, after shuffling the deck well, we would disperse the cards one per each one of the departure delay cards. We would then enter these new values in for arrival delay and compute a sample slope based on this shuffling. We could repeat this process many times, keeping track of our sample slope after each shuffle. 9.4.6 Distribution of \\(\\delta\\) under \\(H_0\\) We can build our randomization distribution in much the same way we did before using the do and shuffle functions. Here we will take advantage of the coef function we saw earlier to extract the slope and intercept coefficients. (Our focus will be on the slope here though.) rand_distn <- mosaic::do(10000) * (lm(formula = shuffle(arr_delay) ~ dep_delay, data = alaska_flights) %>% coef()) names(rand_distn) ## [1] "Intercept" "dep_delay" We see that the names of our columns are Intercept and dep_delay. We want to look at dep_delay since that corresponds to the slope coefficients. ggplot(data = rand_distn, mapping = aes(x = dep_delay)) + geom_histogram(color = "white", bins = 20) 9.4.7 The p-value Recall that we want to see where our observed sample slope \\(\\delta^* = 1.2177\\) falls on this distribution and then count all of the values to the right of it corresponding to \\(H_a: \\beta_0 > 0\\). To get a sense for where our values falls, we can shade all values at least as big as \\(\\delta^*\\). ggplot(data = rand_distn, aes(x = dep_delay, fill = (dep_delay >= b1_obs))) + geom_histogram(color = "white", bins = 20) Figure 9.4: Shaded histogram to show p-value Since 1.2177 falls far to the right of this plot, we can say that we have a \\(p\\)-value of 0. We, thus, have evidence to reject the null hypothesis in support of there being a positive association between the departure delay and arrival delay of all Alaskan flights from NYC in 2013. Learning check (LC9.7) Repeat the inference above but this time for the correlation coefficient instead of the slope. (LC9.8) Use bootstrapping (of points) to determine a range of possible values for the population slope comparing departure delays to arrival delays for Alaskan flights in 2013 from NYC. 9.5 Residual analysis The following diagram will help you to keep track of what is meant by a residual. Here, \\(y_i\\) is an observed value of the arr_delay variable. \\(i\\) ranges from 1 to 50. For this example, it is the vertical component of the blue dot. \\(\\hat{y}_i\\) is the fitted value–the arr_delay value that is being pointed to on the red line. The residual is \\[\\hat{\\epsilon}_i = y_i - \\hat{y}_i.\\] Note the order here! You start at the non-pointy end of the arrow (\\(y_i\\)) and then subtract away what comes at the point (\\(\\hat{y_i}\\)). 9.6 Conditions for regression In order for regression to be valid, we have three conditions to check: Equal variances across explanatory variable (Check residual plot for fan-shaped patterns.) Independent observations, errors, and predictor variables (Check residual plot for no time series-like patterns.) Nearly normal residuals (Check quantile-quantile plot of standardized residuals.) As you can see from the things to check after the conditions residuals will play a large role in determining whether the conditions are met. Residuals are estimates for the error term \\(\\epsilon\\) we discussed earlier, and this is a big reason why they play an important role in validating regression assumptions. Residual plot To construct a residual plot we will analyze data from the augment function in broom. Specifically, we are interested in the .fitted and .resid variables there: fits <- augment(delay_fit) ggplot(data = fits, mapping = aes(x = .fitted, y = .resid)) + geom_point() + geom_abline(intercept = 0, slope = 0, color = "blue") Quantile-quantile plot ggplot(data = fits, mapping = aes(sample = .resid)) + stat_qq() Checking conditions: We are looking to see if the points are scattered about the blue line at 0 relatively evenly as we look from left to right. We have some reason for concern here as the large lump of values on the left are much more dispersed than those on the right. The second condition is invalidated if there is a trigonometric pattern of up and down throughout the residual plot. That is not the case here. We look at the quantile-quantile plot (Q-Q plot for sure) for the third condition. We are looking to see if the residuals fall on a straight line with what we would expect if they were normally distributed. We see some curvature here as well. We should begin to wonder if regression was valid here with both condition 1 and condition 3 in question. We have reason to doubt whether a linear regression is valid here. Unfortunately, all too frequently regressions are run without checking these assumptions carefully. While small deviations from the assumptions can be OK, larger violations can completely invalidate the results and make any inferences improbable and questionable. 9.7 Conclusion 9.7.1 Script of R code An R script file of all R code used in this chapter is available here. 9.7.2 What’s to come? In the last chapter of the textbook, we’ll summarize the purpose of this book as well as present an excellent example of what goes into making an effective story via data. "],
["10-effective-data-storytelling.html", "10 Effective Data Storytelling Concluding Remarks", " 10 Effective Data Storytelling As we’ve progressed throughout this book, you’ve seen how to work with data in a variety of ways. You’ve learned effective strategies for plotting data by understanding which types of plots work best for which combinations of variable types. You’ve summarized data in table form and calculated summary statistics for a variety of different variables. Further, you’ve seen the value of inference as a process to come to conclusions about a population by using a random sample. Lastly, you’ve explored how to use linear regression and the importance of checking the conditions required to make it a valid procedure. All throughout, you’ve learned many computational techniques and focused on reproducible research in writing R code and keeping track of your work in R Markdown. All of these steps go into making a great story using data. As the textbook comes to a close, we thought it best that you explore what stellar work is being produced by data journalists throughout the world that specialize in effective data storytelling. We recommend you read and analyze this article by Walt Hickey entitled The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women. As you read over it, think carefully about how Walt is using his data, his graphics, and his analyses to paint the picture for the reader of what the story is he wants to tell. In the spirit of reproducibility, the members of FiveThirtyEight have also shared the data that they used to create this story and some R code here. A vignette showing how to reproduce one of the plots at the end of the article using dplyr, ggplot2, and other packages in Hadley’s tidyverse is available here. Great data stories don’t mislead the reader, but rather engulf them in understanding the importance that data plays in our lives through the captivation of storytelling. Concluding Remarks If you’ve come to this point in the book, I’d suspect that you know a thing or two about how to work with data in R. You’ve also gained a lot of knowledge about how to use simulation techniques to determine statistical significance and how these techniques build an intuition about traditional inferential methods like the \\(t\\)-test. The hope is that you’ve come to appreciate data manipulation, tidy data sets, and the power of data visualization. Actually, the data visualization part may be the most important thing here. If you can create truly beautiful graphics that display information in ways that the reader can clearly decipher, you’ve picked up a great skill. Let’s hope that that skill keeps you creating great stories with data into the near and far distant future. Thanks for coming along for the ride as we dove into modern data analysis using R! "],
["A-appendixA.html", "A Statistical Background A.1 Basic statistical terms", " A Statistical Background A.1 Basic statistical terms A.1.1 Mean The mean is the most commonly reported measure of center. It is commonly called the “average” though this term can be a little ambiguous. The mean is the sum of all of the data elements divided by how many elements there are. If we have \\(n\\) data points, the mean is given by: \\[Mean = \\frac{x_1 + x_2 + \\cdots + x_n}{n}\\] A.1.2 Median The median is calculated by first sorting a variable’s data from smallest to largest. After sorting the data, the middle element in the list is the median. If the middle falls between two values, then the median is the mean of those two values. A.1.3 Standard deviation We will next discuss the standard deviation of a sample data set pertaining to one variable. The formula can be a little intimidating at first but it is important to remember that it is essentially a measure of how far to expect a given data value is from its mean: \\[Standard \\, deviation = \\sqrt{\\frac{(x_1 - Mean)^2 + (x_2 - Mean)^2 + \\cdots + (x_n - Mean)^2}{n - 1}}\\] A.1.4 Five-number summary The five-number summary consists of five values: minimum, first quantile (25th percentile), median (50th percentile), third quantile (75th) quantile, and maximum. The quantiles are calculated as first quantile (\\(Q_1\\)): the median of the first half of the sorted data third quantile (\\(Q_3\\)): the median of the second half of the sorted data The interquartile range is defined as \\(Q_3 - Q_1\\) and is a measure of how spread out the middle 50% of values is. The five-number summary is not influenced by the presence of outliers in the ways that the mean and standard deviation are. It is, thus, recommended for skewed data sets. A.1.5 Distribution The distribution of a variable/data set corresponds to generalizing patterns in the data set. It often shows how frequently elements in the data set appear. It shows how the data varies and gives some information about where a typical element in the data might fall. Distributions are most easily seen through data visualization. A.1.6 Outliers Outliers correspond to values in the data set that fall far outside the range of “ordinary” values. In regards to a boxplot (by default), they correspond to values below \\(Q_1 - (1.5 * IQR)\\) or above \\(Q_3 + (1.5 * IQR)\\). Note that these terms (aside from Distribution) only apply to quantitative variables. "],
diff --git a/index.Rmd b/index.Rmd
index 5634829ac..7b745467a 100755
--- a/index.Rmd
+++ b/index.Rmd
@@ -55,7 +55,7 @@ These are some principles we keep in mind. If you agree with them, this might be
* This book is in beta testing and is currently at Version `r version`. If you
would like to receive periodic updates on this book and other similar projects,
-please fill out this [Google Form](https://goo.gl/forms/IxiwBeEnk72NxMMx2).
+please sign up [here](http://eepurl.com/cBkItf).
* The source code for this book is available for download/forking on [GitHub](https://github.com/ismayc/moderndiver-book). If you click on the **release** link near the top of the page there, you can download all of the source code for whichever release version you'd like to work with and use. If you find typos or other errors or have suggestions on how to better word something in the book, please create a pull request too! We also welcome issue creation. Let's all work together to make this book as great as possible for as many students and instructors as possible.
* Please feel free to modify the book as you wish for your own needs! All we ask is that you
list the authors field above as "Chester Ismay, Albert Y. Kim, and YOU!"
@@ -121,4 +121,4 @@ knitr::kable(devtools::session_info(needed_pkgs)$packages,
longtable = TRUE)
```
-Book was last updated `r paste("by", Sys.info()[["user"]], "on", format(Sys.time(), "%A, %B %d, %Y %X %Z"))`.
\ No newline at end of file
+Book was last updated `r paste("by", Sys.info()[["user"]], "on", format(Sys.time(), "%A, %B %d, %Y %X %Z"))`.