diff --git a/03-tidy.Rmd b/03-tidy.Rmd index 2e634679c..4cbf4600d 100755 --- a/03-tidy.Rmd +++ b/03-tidy.Rmd @@ -146,7 +146,7 @@ Note that if you look in the leftmost column of the `View(flights)` output, you - specify the variables, and - give the types of variables you are presented with. -The `glimpse()` command in the `dplyr` package provides us with much of the above information and more: +The `glimpse()` command in the `tibble` package provides us with much of the above information and more: ```{r} glimpse(flights) @@ -170,10 +170,10 @@ glimpse(flights) We see that `glimpse` will give you the first few entries of each variable in a row after the variable. In addition, the type of the variable is given immediately after each variable's name inside `< >`. Here, `int` and `num` refer to quantitative variables. In contrast, `chr` refers to categorical variables. One more type of variable is given here with the `time_hour` variable: **dttm**. As you may suspect, this variable corresponds to a specific date and time of day. -Another nice feature of R is the help system. You can get help in R by simply entering a question mark before the name of a function or an object and you will be presented with a page showing the documentation. Note that this output help file is omitted here but can be accessed [here](https://cran.r-project.org/web/packages/nycflights13/nycflights13.pdf) on page 3 of the PDF document. +Another nice feature of R is the help system. You can get help in R by simply entering a question mark before the name of a function or an object and you will be presented with a page showing the documentation. Since `glimpse` is a function defined in the `tibble` package, you can further emphasize that you'd like to look at the help for that specific `glimpse` function by adding the two columns between the package name and the function. Note that these output help files is omitted here but the `flights` help can be accessed [here](https://cran.r-project.org/web/packages/nycflights13/nycflights13.pdf) on page 3 of the PDF document. ```{r eval=FALSE} -?glimpse +?tibble::glimpse ?flights ``` diff --git a/04-viz.Rmd b/04-viz.Rmd index d7b60a179..5402be759 100755 --- a/04-viz.Rmd +++ b/04-viz.Rmd @@ -1,6 +1,6 @@ # Data Visualization via ggplot2 {#viz} -```{r setup_viz, include=FALSE} +```{r setup-viz, include=FALSE, purl=FALSE} chap <- 4 lc <- 0 rq <- 0 @@ -52,7 +52,7 @@ Specifically, we can break a graphic into the following three essential componen In 1812, Napoleon led a French invasion of Russia, marching on Moscow. It was one of the biggest military disasters due in large part to the Russian winter. In 1869, a French civil engineer named Charles Joseph Minard published arguably one of the greatest statistical visualizations of all-time, which summarized this march: -```{r minard, echo=FALSE, fig.cap="Minard's Visualization of Napolean's March"} +```{r minard, echo=FALSE, fig.cap="Minard's Visualization of Napolean's March", purl=FALSE} knitr::include_graphics("images/Minard.png") ``` @@ -163,7 +163,7 @@ This code snippet makes use of functions in the `dplyr` package for data manipul *** -```{block lc-all_alaska_flights, type='learncheck'} +```{block lc-all_alaska_flights, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -198,7 +198,7 @@ In Figure \@ref(fig:noalpha) we see that a positive relationship exists between *** -```{block lc-scatter-plots, type='learncheck'} +```{block lc-scatter-plots, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -242,7 +242,7 @@ Note how this function call is identical to the one in Section \@ref(geompoint), *** -```{block lc-overplotting, type='learncheck'} +```{block lc-overplotting, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -287,7 +287,7 @@ This is similar to the previous use of the `filter` command in Section \@ref(sca *** -```{block lc-early_january_weather, type='learncheck'} +```{block lc-early_january_weather, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -318,7 +318,7 @@ Much as with the `ggplot()` call in Section \@ref(geompoint), we specify the com *** -```{block lc-line-graph, type='learncheck'} +```{block lc-line-graph, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -394,7 +394,7 @@ ggplot(data = weather, mapping = aes(x = temp)) + *** -```{block lc-histogram, type='learncheck'} +```{block lc-histogram, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -431,7 +431,7 @@ As we might expect, the temperature tends to increase as summer approaches and t *** -```{block lc-facet, type='learncheck'} +```{block lc-facet, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -472,7 +472,7 @@ We have introduced a new function called `factor()` here. One of the things thi *** -```{block lc-boxplot, type='learncheck'} +```{block lc-boxplot, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -525,7 +525,7 @@ knitr::kable(flights_table) *** -```{block lc-barplot, type='learncheck'} +```{block lc-barplot, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -539,8 +539,6 @@ knitr::kable(flights_table) *** - - ### Must avoid pie charts! Unfortunately, one of the most common plots seen today for categorical data is the pie chart. While they may see harmless enough, they actually present a problem in that humans are unable to judge angles well. As Naomi Robbins describes in her book "Creating More Effective Graphs" [@robbins2013], we overestimate angles greater than 90 degrees and we underestimate angles less than 90 degrees. In other words, it is difficult for us to determine relative size of one piece of the pie compared to another. @@ -570,13 +568,13 @@ While it is quite easy to look back at the barplot to get the answer to these qu [fd]: https://flowingdata.com/2008/09/19/pie-i-have-eaten-and-pie-i-have-not-eaten/ "Pie I Have Eaten and Pie I Have Not Eaten" -```{r echo=FALSE, fig.align='center', fig.cap="The only good pie chart", out.height=if(knitr:::is_latex_output()) '2.5in'} +```{r echo=FALSE, fig.align='center', fig.cap="The only good pie chart", out.height=if(knitr:::is_latex_output()) '2.5in', purl=FALSE} knitr::include_graphics("images/Pie-I-have-Eaten.jpg") ``` *** -```{block lc-pie-charts, type='learncheck'} +```{block lc-pie-charts, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -610,7 +608,7 @@ This plot is what is known as a **stacked barplot**. While simple to make, it o *** -```{block lc-barplot-two-var, type='learncheck'} +```{block lc-barplot-two-var, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -629,7 +627,7 @@ ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) + *** -```{block lc-barplot-stacked, type='learncheck'} +```{block lc-barplot-stacked, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -653,7 +651,7 @@ Note how the `facet_grid` function arguments are written here. We are wanting t *** -```{block lc-barplot-facet, type='learncheck'} +```{block lc-barplot-facet, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -699,7 +697,8 @@ In addition, we've created a mind map to help you remember which types of plots ### Script of R code -```{r include=FALSE, eval=FALSE} +```{r include=FALSE, eval=FALSE, purl=FALSE} +dir.create("docs/scripts") knitr::purl("04-viz.Rmd", "docs/scripts/04-viz.R") ``` diff --git a/05-manip.Rmd b/05-manip.Rmd index 52fba75a7..eb314fd24 100755 --- a/05-manip.Rmd +++ b/05-manip.Rmd @@ -5,7 +5,7 @@ material here relates to answering those questions --> -```{r setup_manip, include=FALSE} +```{r setup_manip, include=FALSE, purl=FALSE} chap <- 5 lc <- 0 rq <- 0 @@ -40,10 +40,6 @@ library(nycflights13) library(knitr) ``` - - - - ## The pipe `%>%` @@ -56,10 +52,6 @@ Before we introduce the five main verbs, we first introduce the the pipe operato The piping syntax will be our major focus throughout the rest of this book and you'll find that you'll quickly be addicted to the chaining with some practice. If you'd like to see more examples on using `dplyr`, the 5MV (in addition to some other `dplyr` verbs), and `%>%` with the `nycflights13` data set, you can check out Chapter 5 of Hadley and Garrett's book [@rds2016]. - - - - ## Five Main Verbs - The 5MV @@ -78,7 +70,7 @@ Just as we had the 5NG (The Five Named Graphs in Chapter \@ref(viz) using `ggplo ### 5MV#1: Filter observations using filter {#filter} -```{r filter, echo=FALSE, fig.cap="Filter diagram from Data Wrangling with dplyr and tidyr cheatsheet"} +```{r filter, echo=FALSE, fig.cap="Filter diagram from Data Wrangling with dplyr and tidyr cheatsheet", purl=FALSE} knitr::include_graphics("images/filter.png") ``` @@ -143,7 +135,7 @@ As a final note we point out that `filter()` should often be the first verb you' *** -```{block lc-filter, type='learncheck'} +```{block lc-filter, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -155,11 +147,11 @@ As a final note we point out that `filter()` should often be the first verb you' ### 5MV#2: Summarize variables using summarize -```{r sum1, echo=FALSE, fig.cap="Summarize diagram from Data Wrangling with dplyr and tidyr cheatsheet"} +```{r sum1, echo=FALSE, fig.cap="Summarize diagram from Data Wrangling with dplyr and tidyr cheatsheet", purl=FALSE} knitr::include_graphics("images/summarize1.png") ``` -```{r sum2, echo=FALSE, fig.cap="Another summarize diagram from Data Wrangling with dplyr and tidyr cheatsheet"} +```{r sum2, echo=FALSE, fig.cap="Another summarize diagram from Data Wrangling with dplyr and tidyr cheatsheet", purl=FALSE} knitr::include_graphics("images/summary.png") ``` @@ -187,10 +179,6 @@ summary_temp$mean You'll often encounter issues with missing values `NA`. In fact, an entire branch of the field of statistics deals with missing data. However, it is not good practice to include a `na.rm = TRUE` in your summary commands by default; you should attempt to run them without this argument. The idea being you should at the very least be alerted to the presence of missing values and consider what the impact on the analysis might be if you ignore these values. In other words, `na.rm = TRUE` should only be used when necessary. - - What other summary functions can we use inside the `summarize()` verb? Any function in R that takes a vector of values and returns just one. Here are just a few: * `min()` and `max()`: the minimum and maximum values respectively @@ -201,7 +189,7 @@ What other summary functions can we use inside the `summarize()` verb? Any funct *** -```{block lc-summarize, type='learncheck'} +```{block lc-summarize, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -223,7 +211,7 @@ summary_temp <- weather %>% ### 5MV#3: Group rows using group_by -```{r groupsummarize, echo=FALSE, fig.cap="Group by and summarize diagram from Data Wrangling with dplyr and tidyr cheatsheet"} +```{r groupsummarize, echo=FALSE, fig.cap="Group by and summarize diagram from Data Wrangling with dplyr and tidyr cheatsheet", purl=FALSE} knitr::include_graphics("images/group_summary.png") ``` @@ -239,7 +227,8 @@ We believe that you will be amazed at just how simple this is. Run the following ```{r} summary_monthly_temp <- weather %>% group_by(month) %>% - summarize(mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE)) + summarize(mean = mean(temp, na.rm = TRUE), + std_dev = sd(temp, na.rm = TRUE)) summary_monthly_temp ``` @@ -289,7 +278,7 @@ View(by_monthly_origin) ### 5MV#4: Create new variables/change old variables using mutate -```{r select, echo=FALSE, fig.cap="Mutate diagram from Data Wrangling with dplyr and tidyr cheatsheet"} +```{r select, echo=FALSE, fig.cap="Mutate diagram from Data Wrangling with dplyr and tidyr cheatsheet", purl=FALSE} knitr::include_graphics("images/mutate.png") ``` @@ -339,7 +328,7 @@ flights <- flights %>% *** -```{block lc-mutate, type='learncheck'} +```{block lc-mutate, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -369,21 +358,15 @@ freq_dest You'll see that by default the values of `dest` are displayed in alphabetical order here. We are interested in finding those airports that appear most: ```{r} -freq_dest %>% - arrange(num_flights) +freq_dest %>% arrange(num_flights) ``` This is actually giving us the opposite of what we are looking for. It tells us the least frequent destination airports first. To switch the ordering to be descending instead of ascending we use the `desc` function: ```{r} -freq_dest %>% - arrange(desc(num_flights)) +freq_dest %>% arrange(desc(num_flights)) ``` - - - - ## Joining data frames @@ -398,7 +381,7 @@ We see that in `airports`, `carrier` is the carrier code while `name` is the ful Note that the values in the variable `carrier` in `flights` match the values in the variable `carrier` in `airlines`. In this case, we can use the variable `carrier` as a *key variable* to join/merge/match the two data frames by. Hadley and Garrett [@rds2016] created the following diagram to help us understand how the different data sets are linked: -```{r reldiagram, echo=FALSE, fig.cap="Data relationships in nycflights13 from R for Data Science"} +```{r reldiagram, echo=FALSE, fig.cap="Data relationships in nycflights13 from R for Data Science", purl=FALSE} knitr::include_graphics("images/relational-nycflights.png") ``` @@ -418,7 +401,7 @@ We observed that the `flights` and `flights_joined` are identical except that `f A visual representation of the `inner_join` is given below [@rds2016]: -```{r ijdiagram, echo=FALSE, fig.cap="Diagram of inner join from R for Data Science"} +```{r ijdiagram, echo=FALSE, fig.cap="Diagram of inner join from R for Data Science", purl=FALSE} knitr::include_graphics("images/join-inner.png") ``` @@ -470,7 +453,7 @@ In case you didn't know, `"ORD"` is the airport code of Chicago O'Hare airport a *** -```{block lc-join, type='learncheck'} +```{block lc-join, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -490,7 +473,7 @@ In case you didn't know, `"ORD"` is the airport code of Chicago O'Hare airport a ### Select variables using select {#select} -```{r selectfig, echo=FALSE, fig.cap="Select diagram from Data Wrangling with dplyr and tidyr cheatsheet"} +```{r selectfig, echo=FALSE, fig.cap="Select diagram from Data Wrangling with dplyr and tidyr cheatsheet", purl=FALSE} knitr::include_graphics("images/select.png") ``` @@ -606,7 +589,7 @@ View(ten_freq_dests) *** -```{block lc-other-verbs, type='learncheck'} +```{block lc-other-verbs, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -644,7 +627,7 @@ We will focus only on the `dplyr` functions in this book, but you are encouraged ### Script of R code -```{r include=FALSE, eval=FALSE} +```{r include=FALSE, eval=FALSE, purl=FALSE} knitr::purl("05-manip.Rmd", "docs/scripts/05-manip.R") ``` diff --git a/06-sim.Rmd b/06-sim.Rmd index cecfbc1b5..7661349aa 100755 --- a/06-sim.Rmd +++ b/06-sim.Rmd @@ -2,7 +2,7 @@ # Simulating Randomness via mosaic {#sim} -```{r setup_infer, include=FALSE} +```{r setup_infer, include=FALSE, purl=FALSE} chap <- 6 lc <- 0 rq <- 0 @@ -34,7 +34,7 @@ Whenever you hear the phrases "random sampling" or just "sampling" (with regards ### Tasting soup -```{r soupimg, echo=FALSE, fig.cap="A bowl of Indian chicken and vegetable soup"} +```{r soupimg, echo=FALSE, fig.cap="A bowl of Indian chicken and vegetable soup", purl=FALSE} knitr::include_graphics("images/soup.jpg") ``` @@ -91,7 +91,7 @@ A *statistic* is a calculated based on one or more variables measured in the sam *** -```{block lc6-0a, type='learncheck'} +```{block lc6-0a, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -124,7 +124,7 @@ Let's explore these terms for our tasting soup example: - How crunchy the carrots are in our spoonful of soup *** -```{block lc6-0b, type='learncheck'} +```{block lc6-0b, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -161,7 +161,7 @@ We see here that this being self-reported data has led to the data being a littl *** -```{block lc6-0c, type='learncheck'} +```{block lc6-0c, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -188,7 +188,8 @@ We can think of this data as representing the *population* of interest. Let's n ```{r sample-profiles} library(mosaic) set.seed(2017) -profiles_sample1 <- profiles_subset %>% resample(size = 100, replace = FALSE) +profiles_sample1 <- profiles_subset %>% + resample(size = 100, replace = FALSE) ``` The `set.seed` function is used to ensure that all users get the same random sample when they run the code above. It is a way of interfacing with the pseudo-random number generation scheme that R uses to generate "random" numbers. If that command was not run, you'd obtain a different random sample than someone else if you ran the code above for the first time. @@ -203,7 +204,7 @@ ggplot(data = profiles_sample1, mapping = aes(x = height)) + *** -```{block lc6-0d, type='learncheck'} +```{block lc6-0d, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -282,7 +283,7 @@ Note how the range of sample mean height values is much more narrow than the ori *** -```{block lc6-0e, type='learncheck'} +```{block lc6-0e, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -366,7 +367,7 @@ It's amazing that there is no actual evidence that such an event actually took p We need to think about this problem from the standpoint of hypothesis testing. First, we'll need to identify some important parts of a hypothesis test before we proceed with the analysis. *** -```{block lc6-1, type='learncheck'} +```{block lc6-1, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -449,7 +450,7 @@ In this chapter, we've discussed three functions in the `mosaic` package useful - `shuffle`: Its main purpose is to permute the values of one variable across the values of another variable. This acts in much the same way as shuffling a deck of cards and then presenting the shuffled deck to two (or more) players. *** -```{block lc-mosaic, type='learncheck'} +```{block lc-mosaic, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -463,7 +464,7 @@ In this chapter, we've discussed three functions in the `mosaic` package useful ### Script of R code -```{r include=FALSE, eval=FALSE} +```{r include=FALSE, eval=FALSE, purl=FALSE} knitr::purl("06-sim.Rmd", "docs/scripts/06-sim.R") ``` diff --git a/07-hypo.Rmd b/07-hypo.Rmd index e76708b83..1e9bcff79 100755 --- a/07-hypo.Rmd +++ b/07-hypo.Rmd @@ -1,6 +1,6 @@ # Hypothesis Testing {#hypo} -```{r setup_hypo, include=FALSE} +```{r setup_hypo, include=FALSE, purl=FALSE} chap <- 7 lc <- 0 rq <- 0 @@ -49,7 +49,7 @@ kable(bos_sfo_summary) Looking at these results, we can clearly see that SFO `air_time` is much larger than BOS `air_time`. The standard deviation is also extremely informative here. *** -```{block lc6-2b, type='learncheck'} +```{block lc6-2b, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -81,7 +81,7 @@ Hypothesis testing brings about many weird and incorrect notions in the scientif You'll see that we don't need to rely on these complicated series of assumptions and procedures to conduct a hypothesis test any longer. These methods were introduced in a time when computers weren't powerful. Your cellphone (in 2016) has more power than the computers that sent NASA astronauts to the moon after all. We'll see that ALL hypothesis tests can be broken down into the following framework given by Allen Downey [here](http://allendowney.blogspot.com/2016/06/there-is-still-only-one-test.html): -```{r htdowney, echo=FALSE, fig.cap="Hypothesis Testing Framework"} +```{r htdowney, echo=FALSE, fig.cap="Hypothesis Testing Framework", purl=FALSE} knitr::include_graphics("images/ht.png") ``` @@ -142,7 +142,7 @@ The risk of error is the price researchers pay for basing an inference about a p To help understand the concepts of Type I error and Type II error, observe the following table: -```{r, echo=FALSE, fig.cap="Type I and Type II errors"} +```{r, echo=FALSE, fig.cap="Type I and Type II errors", purl=FALSE} knitr::include_graphics("images/errors.png") ``` @@ -164,7 +164,7 @@ So if we can set $\alpha$ to be whatever we want, why choose 0.05 instead of 0.0 *** -```{block lc7-0, type='learncheck'} +```{block lc7-0, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -182,7 +182,7 @@ So if we can set $\alpha$ to be whatever we want, why choose 0.05 instead of 0.0 The idea that sample results are more extreme than we would reasonably expect to see by random chance if the null hypothesis were true is the fundamental idea behind statistical hypothesis tests. If data at least as extreme would be very unlikely if the null hypothesis were true, we say the data are **statistically significant**. Statistically significant data provide convincing evidence against the null hypothesis in favor of the alternative, and allow us to generalize our sample results to the claim about the population. -```{block lc7-1, type='learncheck'} +```{block lc7-1, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -199,7 +199,7 @@ The idea that sample results are more extreme than we would reasonably expect to Recall the "There is Only One Test" diagram from earlier: -```{r htdowney2, echo=FALSE, fig.cap="Hypothesis Testing Framework"} +```{r htdowney2, echo=FALSE, fig.cap="Hypothesis Testing Framework", purl=FALSE} knitr::include_graphics("images/ht.png") ``` @@ -312,7 +312,7 @@ library(ggplot2) This helps us better see just how few of the values of `heads` are at our observed value or more extreme. This idea of a $p$-value can be extended to the more traditional methods using normal and $t$ distributions in the traditional way that introductory statistics has been presented. These traditional methods were used because statisticians haven't always been able to do 10,000 simulations on the computer within seconds. We'll elaborate on this more in a few sections. *** -```{block lc6-2, type='learncheck'} +```{block lc6-2, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -367,7 +367,7 @@ movies_trimmed <- movies_trimmed %>% We are left with `r nrow(movies_trimmed)` movies in our _population_ data set that focuses on only `"Action"` and `"Romance"` movies. *** -```{block lc7-2, type='learncheck'} +```{block lc7-2, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -402,7 +402,7 @@ We can use hypothesis testing to investigate ways to determine, for example, whe We are interested here in seeing how we can use a random sample of action movies and a random sample of romance movies from `movies` to determine if a statistical difference exists in the mean ratings of each group. *** -```{block lc7-3a, type='learncheck'} +```{block lc7-3a, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -439,7 +439,7 @@ ggplot(data = movies_genre_sample, mapping = aes(x = rating)) + ``` *** -```{block lc7-3b1, type='learncheck'} +```{block lc7-3b1, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -459,7 +459,7 @@ summary_ratings %>% kable() ``` *** -```{block lc7-3b2, type='learncheck'} +```{block lc7-3b2, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -470,7 +470,7 @@ summary_ratings %>% kable() We see that the sample mean rating for romance movies, $\bar{x}_{r}$, is greater than the similar measure for action movies, $\bar{x}_a$. But is it statistically significantly greater (thus, leading us to conclude that the means are statistically different)? The standard deviation can provide some insight here but with these standard deviations being so similar it's still hard to say for sure. *** -```{block lc7-3b3, type='learncheck'} +```{block lc7-3b3, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -519,7 +519,7 @@ diff(shuffled_ratings$mean) ``` *** -```{block lc7-3b4, type='learncheck'} +```{block lc7-3b4, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -580,7 +580,7 @@ Based on this plot, we have no values as extreme or more extreme than our observ *** -```{block lc7-3b, type='learncheck'} +```{block lc7-3b, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -752,7 +752,7 @@ Since all three conditions are met, we can be reasonably certain that the theory ### Script of R code -```{r include=FALSE, eval=FALSE} +```{r include=FALSE, eval=FALSE, purl=FALSE} knitr::purl("07-hypo.Rmd", "docs/scripts/07-hypo.R") ``` diff --git a/08-ci.Rmd b/08-ci.Rmd index 3ec3547b9..f50c78e22 100755 --- a/08-ci.Rmd +++ b/08-ci.Rmd @@ -1,7 +1,7 @@ # Confidence Intervals {#ci} -```{r setup_ci, include=FALSE} +```{r setup_ci, include=FALSE, purl=FALSE} chap <- 8 lc <- 0 rq <- 0 @@ -94,7 +94,8 @@ movies_sample %>% ggplot(aes(x = rating)) + Remember that we can think of this histogram as an estimate of our population distribution histogram that we saw above. We are interested in the population mean rating and trying to find a range of plausible values for that value. A good start in guessing the population mean is to use the mean of our sample `rating` from the `movies_sample` data: ```{r} -(movies_sample_mean <- movies_sample %>% summarize(mean = mean(rating))) +(movies_sample_mean <- movies_sample %>% + summarize(mean = mean(rating))) ``` Note the use of the `( )` at the beginning and the end of this creation of the `movies_sample_mean` object. If you'd like to print out your newly created object, you can enclose it in the parentheses as we have here. @@ -114,7 +115,7 @@ You may be asking yourself what does this mean and how does this lead us to crea *** -```{block lc6-3, type='learncheck'} +```{block lc6-3, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -168,7 +169,7 @@ This statement may seem a little confusing to you. Another way to think about t To further reiterate this point, the graphic below from @isrs2014 shows us that if we repeated a confidence interval process 25 times with 25 different samples, we would expect about 95% of them to actually contain the population parameter of interest. This parameter is marked with a dotted vertical line. We can see that only one confidence interval does not overlap with this value. (The one marked in red.) Therefore 24 in 25 (96%), which is quite close to our 95% reliability, do include the population parameter. -```{r ci-coverage, echo=FALSE, fig.cap="Confidence interval coverage plot from OpenIntro"} +```{r ci-coverage, echo=FALSE, fig.cap="Confidence interval coverage plot from OpenIntro", purl=FALSE} knitr::include_graphics("images/cis.png") ``` @@ -201,7 +202,7 @@ To compute this type of confidence interval, we only need to make a slight modif *** -```{block lc7-4, type='learncheck'} +```{block lc7-4, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -225,7 +226,7 @@ We can summarize the process to generate a bootstrap distribution here in a seri Visually, we can represent this process in the following diagram. -```{r bootstrapimg, echo=FALSE, fig.cap="Bootstrapping diagram from Lock5 textbook"} +```{r bootstrapimg, echo=FALSE, fig.cap="Bootstrapping diagram from Lock5 textbook", purl=FALSE} knitr::include_graphics("images/bootstrap.png") ``` @@ -269,7 +270,7 @@ It's worthy of mention here that confidence intervals are always centered at the *** -```{block lc8-1, type='learncheck'} +```{block lc8-1, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -309,7 +310,7 @@ Describe how $s / \sqrt{n}$ does in approximating the standard error for these t ### Script of R code -```{r include=FALSE, eval=FALSE} +```{r include=FALSE, eval=FALSE, purl=FALSE} knitr::purl("08-ci.Rmd", "docs/scripts/08-ci.R") ``` diff --git a/09-regress.Rmd b/09-regress.Rmd index 7dd37ae02..9ead2b6f9 100755 --- a/09-regress.Rmd +++ b/09-regress.Rmd @@ -2,7 +2,7 @@ One of the most commonly used statistical procedures is *regression*. Regression, in its simplest form, focuses on trying to predict values of one numerical variable based on the values of another numerical variable using a straight line fit to data. We saw in Chapters \@ref(hypo) and \@ref(ci) an example of analyses using a categorical predictor (movie genre--action or romance) and a numerical response (movie rating). In this chapter, we will focus on going back to the `flights` data frame in the `nycflights13` package to look at the relationship between departure delay and arrival delay. We will also discuss the concept of *correlation* and how it is frequently incorrectly implied to also lead to *causation*. This chapter also introduces the `broom` package, which is a useful tool in summarizing the results of model fits in tidy format. You will see examples of the `tidy`, `glance`, and `augment` functions with linear regression. -```{r setup_reg, include=FALSE} +```{r setup_reg, include=FALSE, purl=FALSE} chap <- 9 lc <- 0 rq <- 0 @@ -54,7 +54,7 @@ ggplot(data = alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) + *** -```{block lc9-1, type='learncheck'} +```{block lc9-1, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -86,7 +86,7 @@ It is always between -1 and 1, inclusive, where *** -```{block lc9-2, type='learncheck'} +```{block lc9-2, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -140,7 +140,7 @@ The sample correlation coefficient is denoted by $r$. In this case, $r = `r cor( *** -```{block lc9-3, type='learncheck'} +```{block lc9-3, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -168,7 +168,7 @@ Be careful as you read studies to make sure that the writers aren't falling into *** -```{block lc9-4, type='learncheck'} +```{block lc9-4, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -389,7 +389,7 @@ Since `r b1_obs` falls far to the right of this plot, we can say that we have a *** -```{block lc9-5, type='learncheck'} +```{block lc9-5, type='learncheck', purl=FALSE} **_Learning check_** ``` @@ -459,7 +459,7 @@ We have reason to doubt whether a linear regression is valid here. Unfortunatel ### Script of R code -```{r include=FALSE, eval=FALSE} +```{r include=FALSE, eval=FALSE, purl=FALSE} knitr::purl("09-regress.Rmd", "docs/scripts/09-regress.R") ``` diff --git a/NEWS.md b/NEWS.md index 86cd55ec4..b40b5c53b 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,8 +1,11 @@ -# ModernDive 0.1.1.9000 +# ModernDive 0.1.2.9000 + +# ModernDive 0.1.2 * Converted last updated in index.Rmd to inline instead of R chunk * Fixed edit link to point to moderndive-book GitHub repo instead of moderndive-source repo * Fixed broken links to script files at the end of Chapters 4-9 +* Added `purl=FALSE` to chunks that do not contain useful code to the reader * Attempting to fix Shiny app in Figure 6.2 appearing as white box in published site noted [here](https://github.com/ismayc/moderndiver-book/issues/2) # ModernDive 0.1.1 diff --git a/_bookdown.yml b/_bookdown.yml index b0dfe28ab..cb72ecc77 100755 --- a/_bookdown.yml +++ b/_bookdown.yml @@ -1,3 +1,3 @@ book_filename: "ismaykim" -output_dir: "docs-devel" +output_dir: "docs" #chapter_name: "Chapter " diff --git a/docs/10-effective-data-storytelling.html b/docs/10-effective-data-storytelling.html index 36a2f1fc0..b7b51e0cd 100644 --- a/docs/10-effective-data-storytelling.html +++ b/docs/10-effective-data-storytelling.html @@ -26,7 +26,7 @@ - + @@ -371,7 +371,7 @@
nycflig
- specify the variables, and
- give the types of variables you are presented with.
The glimpse()
command in the dplyr
package provides us with much of the above information and more:
The glimpse()
command in the tibble
package provides us with much of the above information and more:
glimpse(flights)
## Observations: 336,776
## Variables: 19
@@ -568,8 +568,8 @@ 3.2 Datasets in the nycflig
(LC3.7) How many different rows are in this dataset?
We see that glimpse
will give you the first few entries of each variable in a row after the variable. In addition, the type of the variable is given immediately after each variable’s name inside < >
. Here, int
and num
refer to quantitative variables. In contrast, chr
refers to categorical variables. One more type of variable is given here with the time_hour
variable: dttm. As you may suspect, this variable corresponds to a specific date and time of day.
-Another nice feature of R is the help system. You can get help in R by simply entering a question mark before the name of a function or an object and you will be presented with a page showing the documentation. Note that this output help file is omitted here but can be accessed here on page 3 of the PDF document.
-?glimpse
+Another nice feature of R is the help system. You can get help in R by simply entering a question mark before the name of a function or an object and you will be presented with a page showing the documentation. Since glimpse
is a function defined in the tibble
package, you can further emphasize that you’d like to look at the help for that specific glimpse
function by adding the two columns between the package name and the function. Note that these output help files is omitted here but the flights
help can be accessed here on page 3 of the PDF document.
+?tibble::glimpse
?flights
Another aspect of tidy data is a description of what each variable in the dataset represents. This helps others to understand what your variable names mean and what they correspond to. If we look at the output of ?flights
, we can see that a description of each variable by name is given.
An important feature to ALWAYS include with your data is the appropriate units of measurement. We’ll see this further when we work with the dep_delay
variable in Chapter 4. (It’s in minutes, but you’d get some really strange interpretations if you thought it was in hours or seconds. UNITS MATTER!)
@@ -845,7 +845,7 @@ References
"size": 2
},
"edit": {
-"link": "https://github.com/ismayc/moderndiver-source/edit/master/03-tidy.Rmd",
+"link": "https://github.com/ismayc/moderndiver-book/edit/master/03-tidy.Rmd",
"text": "Edit"
},
"download": ["ismaykim.pdf"],
diff --git a/docs/4-viz.html b/docs/4-viz.html
index a5328b2e3..d63ba927c 100644
--- a/docs/4-viz.html
+++ b/docs/4-viz.html
@@ -26,7 +26,7 @@
-
+
@@ -371,7 +371,7 @@
- B.6.5 Comparing results
-- C Reach for the Starts
+- C Reach for the Stars
- Needed packages
- C.1 Sorted barplots
- C.2 Interactive graphics
@@ -1173,7 +1173,7 @@ 4.9.1 Resources
4.9.2 Script of R code
-An R script file of all R code used in this chapter is available here.
+An R script file of all R code used in this chapter is available here.
4.9.3 What’s to come?
@@ -1228,7 +1228,7 @@ References
"size": 2
},
"edit": {
-"link": "https://github.com/ismayc/moderndiver-source/edit/master/04-viz.Rmd",
+"link": "https://github.com/ismayc/moderndiver-book/edit/master/04-viz.Rmd",
"text": "Edit"
},
"download": ["ismaykim.pdf"],
diff --git a/docs/5-manip.html b/docs/5-manip.html
index 0d3e2a542..e2479a0cf 100644
--- a/docs/5-manip.html
+++ b/docs/5-manip.html
@@ -26,7 +26,7 @@
-
+
@@ -371,7 +371,7 @@
- B.6.5 Comparing results
-- C Reach for the Starts
+- C Reach for the Stars
- Needed packages
- C.1 Sorted barplots
- C.2 Interactive graphics
@@ -531,9 +531,6 @@ 5.2.2 5MV#2: Summarize variables
summary_temp$mean
## [1] 55.2
You’ll often encounter issues with missing values NA
. In fact, an entire branch of the field of statistics deals with missing data. However, it is not good practice to include a na.rm = TRUE
in your summary commands by default; you should attempt to run them without this argument. The idea being you should at the very least be alerted to the presence of missing values and consider what the impact on the analysis might be if you ignore these values. In other words, na.rm = TRUE
should only be used when necessary.
-
What other summary functions can we use inside the summarize()
verb? Any function in R that takes a vector of values and returns just one. Here are just a few:
min()
and max()
: the minimum and maximum values respectively
@@ -573,7 +570,8 @@ 5.2.3 5MV#3: Group rows using gro
We believe that you will be amazed at just how simple this is. Run the following code:
summary_monthly_temp <- weather %>%
group_by(month) %>%
- summarize(mean = mean(temp, na.rm = TRUE), std_dev = sd(temp, na.rm = TRUE))
+ summarize(mean = mean(temp, na.rm = TRUE),
+ std_dev = sd(temp, na.rm = TRUE))
summary_monthly_temp
## # A tibble: 12 × 3
## month mean std_dev
@@ -701,8 +699,7 @@ 5.2.5 5MV#5: Reorder the data fra
## 10 BHM 297
## # ... with 95 more rows
You’ll see that by default the values of dest
are displayed in alphabetical order here. We are interested in finding those airports that appear most:
-freq_dest %>%
- arrange(num_flights)
+freq_dest %>% arrange(num_flights)
## # A tibble: 105 × 2
## dest num_flights
## <chr> <int>
@@ -718,8 +715,7 @@ 5.2.5 5MV#5: Reorder the data fra
## 10 BZN 36
## # ... with 95 more rows
This is actually giving us the opposite of what we are looking for. It tells us the least frequent destination airports first. To switch the ordering to be descending instead of ascending we use the desc
function:
-freq_dest %>%
- arrange(desc(num_flights))
+freq_dest %>% arrange(desc(num_flights))
## # A tibble: 105 × 2
## dest num_flights
## <chr> <int>
@@ -898,7 +894,7 @@ 5.5.1 Resources
5.5.2 Script of R code
-An R script file of all R code used in this chapter is available here.
+An R script file of all R code used in this chapter is available here.
5.5.3 What’s to come?
@@ -954,7 +950,7 @@ References
"size": 2
},
"edit": {
-"link": "https://github.com/ismayc/moderndiver-source/edit/master/05-manip.Rmd",
+"link": "https://github.com/ismayc/moderndiver-book/edit/master/05-manip.Rmd",
"text": "Edit"
},
"download": ["ismaykim.pdf"],
diff --git a/docs/6-sim.html b/docs/6-sim.html
index 92bbaec80..1d1b30b99 100644
--- a/docs/6-sim.html
+++ b/docs/6-sim.html
@@ -26,7 +26,7 @@
-
+
@@ -371,7 +371,7 @@
- B.6.5 Comparing results
-- C Reach for the Starts
+- C Reach for the Stars
- Needed packages
- C.1 Sorted barplots
- C.2 Interactive graphics
@@ -517,7 +517,8 @@ 6.2 Visualizing sampling
We can think of this data as representing the population of interest. Let’s now take a random sample of size 100 from this population and look to see if this sample represents the overall shape of the population. In other words, we are going to use data visualization as our guide to understand the representativeness of the sample selected.
library(mosaic)
set.seed(2017)
-profiles_sample1 <- profiles_subset %>% resample(size = 100, replace = FALSE)
+profiles_sample1 <- profiles_subset %>%
+ resample(size = 100, replace = FALSE)
The set.seed
function is used to ensure that all users get the same random sample when they run the code above. It is a way of interfacing with the pseudo-random number generation scheme that R uses to generate “random” numbers. If that command was not run, you’d obtain a different random sample than someone else if you ran the code above for the first time.
We have introduced the resample
function from the mosaic
package here (Pruim, Kaplan, and Horton 2016). This function can be used for both sampling with and without replacement. Here we have chosen to sample without replacement. In other words, after the first row is chosen from the profiles_subset
data frame at random it is kept out of the further 99 samples. Let’s now visualize the 100 values of the height
variable in the profiles_sample1
data frame. To keep this visualization on the same horizontal scale as our original population presented in profiles_subset
we can use the coord_cartesian
function along with the c
function to specify the limits on the horizontal axis.
ggplot(data = profiles_sample1, mapping = aes(x = height)) +
@@ -566,7 +567,7 @@ 6.2.1 Sampling distribution
@@ -719,7 +720,7 @@ 6.4 Review of mosaic
6.5 Conclusion
6.5.1 Script of R code
-An R script file of all R code used in this chapter is available here.
+An R script file of all R code used in this chapter is available here.
6.5.2 What’s to come?
@@ -777,7 +778,7 @@ References
"size": 2
},
"edit": {
-"link": "https://github.com/ismayc/moderndiver-source/edit/master/06-sim.Rmd",
+"link": "https://github.com/ismayc/moderndiver-book/edit/master/06-sim.Rmd",
"text": "Edit"
},
"download": ["ismaykim.pdf"],
diff --git a/docs/7-hypo.html b/docs/7-hypo.html
index 3d1fa502c..9d33070bb 100644
--- a/docs/7-hypo.html
+++ b/docs/7-hypo.html
@@ -26,7 +26,7 @@
-
+
@@ -371,7 +371,7 @@
- B.6.5 Comparing results
-- C Reach for the Starts
+- C Reach for the Stars
- Needed packages
- C.1 Sorted barplots
- C.2 Interactive graphics
@@ -1173,7 +1173,7 @@ 7.8.2 Conditions for t-test
7.9 Conclusion
7.9.1 Script of R code
-An R script file of all R code used in this chapter is available here.
+An R script file of all R code used in this chapter is available here.
7.9.2 What’s to come?
@@ -1223,7 +1223,7 @@ References
"size": 2
},
"edit": {
-"link": "https://github.com/ismayc/moderndiver-source/edit/master/07-hypo.Rmd",
+"link": "https://github.com/ismayc/moderndiver-book/edit/master/07-hypo.Rmd",
"text": "Edit"
},
"download": ["ismaykim.pdf"],
diff --git a/docs/8-ci.html b/docs/8-ci.html
index 91a1c8c2e..0eb4472a3 100644
--- a/docs/8-ci.html
+++ b/docs/8-ci.html
@@ -26,7 +26,7 @@
-
+
@@ -371,7 +371,7 @@
- B.6.5 Comparing results
-- C Reach for the Starts
+- C Reach for the Stars
- Needed packages
- C.1 Sorted barplots
- C.2 Interactive graphics
@@ -458,7 +458,8 @@ 8.1 Bootstrapping
Remember that we can think of this histogram as an estimate of our population distribution histogram that we saw above. We are interested in the population mean rating and trying to find a range of plausible values for that value. A good start in guessing the population mean is to use the mean of our sample rating
from the movies_sample
data:
-(movies_sample_mean <- movies_sample %>% summarize(mean = mean(rating)))
+(movies_sample_mean <- movies_sample %>%
+ summarize(mean = mean(rating)))
## # A tibble: 1 × 1
## mean
## <dbl>
@@ -640,7 +641,7 @@ 8.3 Effect size
8.4 Conclusion
8.4.1 Script of R code
-An R script file of all R code used in this chapter is available here.
+An R script file of all R code used in this chapter is available here.
8.4.2 What’s to come?
@@ -692,7 +693,7 @@ References
"size": 2
},
"edit": {
-"link": "https://github.com/ismayc/moderndiver-source/edit/master/08-ci.Rmd",
+"link": "https://github.com/ismayc/moderndiver-book/edit/master/08-ci.Rmd",
"text": "Edit"
},
"download": ["ismaykim.pdf"],
diff --git a/docs/9-regress.html b/docs/9-regress.html
index 364c4b34e..635723200 100644
--- a/docs/9-regress.html
+++ b/docs/9-regress.html
@@ -26,7 +26,7 @@
-
+
@@ -371,7 +371,7 @@
- B.6.5 Comparing results
-- C Reach for the Starts
+- C Reach for the Stars
- Needed packages
- C.1 Sorted barplots
- C.2 Interactive graphics
@@ -703,7 +703,7 @@ 9.6 Conditions for regression
9.7 Conclusion
9.7.1 Script of R code
-An R script file of all R code used in this chapter is available here.
+An R script file of all R code used in this chapter is available here.
9.7.2 What’s to come?
@@ -749,7 +749,7 @@ 9.7.2 What’s to come?
"size": 2
},
"edit": {
-"link": "https://github.com/ismayc/moderndiver-source/edit/master/09-regress.Rmd",
+"link": "https://github.com/ismayc/moderndiver-book/edit/master/09-regress.Rmd",
"text": "Edit"
},
"download": ["ismaykim.pdf"],
diff --git a/docs/A-appendixA.html b/docs/A-appendixA.html
index 58b85eaf0..1ff4bb795 100644
--- a/docs/A-appendixA.html
+++ b/docs/A-appendixA.html
@@ -26,7 +26,7 @@
-
+
@@ -371,7 +371,7 @@
- B.6.5 Comparing results
-- C Reach for the Starts
+- C Reach for the Stars
- Needed packages
- C.1 Sorted barplots
- C.2 Interactive graphics
@@ -468,7 +468,7 @@ A.1.6 Outliers
"size": 2
},
"edit": {
-"link": "https://github.com/ismayc/moderndiver-source/edit/master/91-appendixA.Rmd",
+"link": "https://github.com/ismayc/moderndiver-book/edit/master/91-appendixA.Rmd",
"text": "Edit"
},
"download": ["ismaykim.pdf"],
diff --git a/docs/B-appendixB.html b/docs/B-appendixB.html
index f96c30831..6cdcbbc14 100644
--- a/docs/B-appendixB.html
+++ b/docs/B-appendixB.html
@@ -26,7 +26,7 @@
-
+
@@ -371,7 +371,7 @@
- B.6.5 Comparing results
-- C Reach for the Starts
+- C Reach for the Stars
- Needed packages
- C.1 Sorted barplots
- C.2 Interactive graphics
@@ -1469,7 +1469,7 @@ References
"size": 2
},
"edit": {
-"link": "https://github.com/ismayc/moderndiver-source/edit/master/92-appendixB.Rmd",
+"link": "https://github.com/ismayc/moderndiver-book/edit/master/92-appendixB.Rmd",
"text": "Edit"
},
"download": ["ismaykim.pdf"],
diff --git a/docs/C-appendixC.html b/docs/C-appendixC.html
index df2011ca3..9bae9b5d7 100644
--- a/docs/C-appendixC.html
+++ b/docs/C-appendixC.html
@@ -26,7 +26,7 @@
-
+
@@ -371,7 +371,7 @@
- B.6.5 Comparing results
-- C Reach for the Starts
+- C Reach for the Stars
- Needed packages
- C.1 Sorted barplots
- C.2 Interactive graphics
@@ -397,7 +397,7 @@
-C Reach for the Starts
+C Reach for the Stars
Needed packages
library(dplyr)
@@ -448,7 +448,7 @@ C.2.1 Interactive line-graphs
select(flights_summarized, -date)
dyRangeSelector(dygraph(flights_summarized))
-
+
The syntax here is a little different than what we have covered so far. The dygraph
function is expecting for the dates to be given as the rownames
of the object. We then remove the date
variable from the flights_summarized
dataframe since it is accounted for in the rownames
. Lastly, we run the dygraph
function on the new dataframe that only contains the median arrival delay as a column and then provide the ability to have a selector to zoom in on the interactive plot via dyRangeSelector
. (Note that this plot will only be interactive in the HTML version of this book.)