Skip to content

Commit

Permalink
links
Browse files Browse the repository at this point in the history
  • Loading branch information
ismayc committed Nov 10, 2024
1 parent f1c6300 commit e025016
Show file tree
Hide file tree
Showing 6 changed files with 28 additions and 28 deletions.
28 changes: 14 additions & 14 deletions 05-regression.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -207,7 +207,7 @@ UN_data_ch5 |>

However, what if we want other summary statistics as well, such as the standard deviation (a measure of spread), the minimum and maximum values, and various percentiles?

Typing out all these summary statistic functions in `summarize()` would be long and tedious. Instead, we use the convenient `tidy_summary()` function from the `moderndive` \index{R packages!moderndive!tidy\_summary()} package.
Typing out all these summary statistic functions in `summarize()` would be long and tedious. Instead, we use the convenient [`tidy_summary()`](https://moderndive.github.io/moderndive/reference/tidy_summary.html) function from the `moderndive` \index{R packages!moderndive!tidy\_summary()} package.

This function takes in a data frame, summarizes it, and returns commonly used summary statistics in tidy format. We take our `UN_data_ch5` data frame, `select()` only the outcome and explanatory variables `fert_rate` and `life_exp`, and pipe them into the `tidy_summary` function:

Expand All @@ -228,7 +228,7 @@ UN_data_ch5 |>
)
```

We can also do this more directly by providing which `columns` we'd like a summary of inside the `tidy_summary()` function:
We can also do this more directly by providing which `columns` we'd like a summary of inside the [`tidy_summary()`](https://moderndive.github.io/moderndive/reference/tidy_summary.html) function:

```{r eval=FALSE}
UN_data_ch5 |>
Expand Down Expand Up @@ -274,7 +274,7 @@ life_summary_df <- summary_df |>

Looking at this output, we can see how the values of both variables distribute. For example, the median fertility rate was `r fert_summary_df$median[1]`, whereas the median life expectancy was `r life_summary_df$median[1]` years. The middle 50% of fertility rates was between `r fert_summary_df$Q1[[1]]` and `r fert_summary_df$Q3[[1]]` (the first and third quartiles), and the middle 50% of life expectancies was from `r life_summary_df$Q1[[1]]` to `r life_summary_df$Q3[[1]]`.

The `tidy_summary()` function only returns what are known as *univariate* \index{univariate} summary statistics: functions that take a single variable and return some numerical summary of that variable. However, there also exist *bivariate* \index{bivariate} summary statistics: functions that take in two variables and return some summary of those two variables.
The [`tidy_summary()`](https://moderndive.github.io/moderndive/reference/tidy_summary.html) function only returns what are known as *univariate* \index{univariate} summary statistics: functions that take a single variable and return some numerical summary of that variable. However, there also exist *bivariate* \index{bivariate} summary statistics: functions that take in two variables and return some summary of those two variables.

In particular, when the two variables are numerical, we can compute the \index{correlation (coefficient)} *correlation coefficient*. Generally speaking, *coefficients* are quantitative expressions of a specific phenomenon. A *correlation coefficient* is a quantitative expression of the *strength of the linear relationship between two numerical variables*. Its value goes from -1 and 1 where:

Expand Down Expand Up @@ -329,7 +329,7 @@ if (is_latex_output()) {

For example, observe in the top right plot that for a correlation coefficient of -0.75 there is a negative linear relationship between $x$ and $y$, but it is not as strong as the negative linear relationship between $x$ and $y$ when the correlation coefficient is -0.9 or -1.

The correlation coefficient can be computed using the `get_correlation()` \index{R packages!moderndive!get\_correlation()} function in the `moderndive` package. In this case, the inputs to the function are the two numerical variables for which we want to calculate the correlation coefficient.
The correlation coefficient can be computed using the [`get_correlation()`](https://moderndive.github.io/moderndive/reference/get_correlation.html) \index{R packages!moderndive!get\_correlation()} function in the `moderndive` package. In this case, the inputs to the function are the two numerical variables for which we want to calculate the correlation coefficient.

We put the name of the outcome variable on the left-hand side of the `~` "tilde" sign, while putting the name of the explanatory variable on the right-hand side. This is known as R's \index{R!formula notation} *formula notation*. We will use this same "formula" syntax with regression later in this chapter.

Expand Down Expand Up @@ -478,7 +478,7 @@ First, we "fit" the linear regression model to the `data` using the `lm()` \inde

* `y` is the outcome variable, followed by a tilde `~`. In our case, `y` is set to `fert_rate`.
* `x` is the explanatory variable. In our case, `x` is set to `life_exp`.
* The combination of `y ~ x` is called a *model formula*. (Note the order of `y` and `x`.) In our case, the model formula is `fert_rate ~ life_exp`. We saw such model formulas earlier when we computed the correlation coefficient using the `get_correlation()` function in Subsection \@ref(model1EDA).
* The combination of `y ~ x` is called a *model formula*. (Note the order of `y` and `x`.) In our case, the model formula is `fert_rate ~ life_exp`. We saw such model formulas earlier when we computed the correlation coefficient using the [`get_correlation()`](https://moderndive.github.io/moderndive/reference/get_correlation.html) function in Subsection \@ref(model1EDA).
* `data_frame_name` is the name of the data frame that contains the variables `y` and `x`. In our case, `data` is the `UN_data_ch5` data frame.

Second, we take the saved model in `demographics_model` and apply the `coef()` function to it to obtain the regression coefficients. This gives us the components of the regression equation line: the intercept $b_0$ and the slope $b_1$.
Expand Down Expand Up @@ -572,7 +572,7 @@ best_fit_plot

Now say we want to compute both the fitted value $\widehat{y} = b_0 + b_1 \cdot x$ and the residual $y - \widehat{y}$ for *all* `r n_demo_ch5` UN member states with complete data as of 2024. Recall that each country corresponds to one of the `r n_demo_ch5` rows in the `UN_data_ch5` data frame and also one of the `r n_demo_ch5` points in the regression plot in Figure \@ref(fig:numxplot4).

We could repeat the previous calculations we performed by hand `r n_demo_ch5` times, but that would be tedious and time consuming. Instead, we do this using a computer with the `get_regression_points()` function. We apply the `get_regression_points()` function to `demographics_model`, which is where we saved our `lm()` model in the previous section. In Table \@ref(tab:regression-points-1) we present the results of only the 21st through 24th courses for brevity's sake.
We could repeat the previous calculations we performed by hand `r n_demo_ch5` times, but that would be tedious and time consuming. Instead, we do this using a computer with the [`get_regression_points()`](https://moderndive.com/v2/moderndive.github.io/moderndive/reference/get_regression_points.html) function. We apply the [`get_regression_points()`](https://moderndive.com/v2/moderndive.github.io/moderndive/reference/get_regression_points.html) function to `demographics_model`, which is where we saved our `lm()` model in the previous section. In Table \@ref(tab:regression-points-1) we present the results of only the 21st through 24th courses for brevity's sake.

```{r, eval=FALSE}
regression_points <- get_regression_points(demographics_model)
Expand Down Expand Up @@ -605,7 +605,7 @@ This function is an example of what is known in computer programming as a *wrapp
include_graphics("images/shutterstock/wrapper_function.png")
```

So all you need to worry about is what the inputs look like and what the outputs look like; you leave all the other details "under the hood of the car." In our regression modeling example, the `get_regression_points()` function takes a saved `lm()` linear regression model as input and returns a data frame of the regression predictions as output. If you are interested in learning more about the `get_regression_points()` function's inner workings, check out Subsection \@ref(underthehood).
So all you need to worry about is what the inputs look like and what the outputs look like; you leave all the other details "under the hood of the car." In our regression modeling example, the [`get_regression_points()`](https://moderndive.com/v2/moderndive.github.io/moderndive/reference/get_regression_points.html) function takes a saved `lm()` linear regression model as input and returns a data frame of the regression predictions as output. If you are interested in learning more about the [`get_regression_points()`](https://moderndive.com/v2/moderndive.github.io/moderndive/reference/get_regression_points.html) function's inner workings, check out Subsection \@ref(underthehood).

We inspect the individual columns and match them with the elements of Figure \@ref(fig:numxplot4):

Expand Down Expand Up @@ -701,7 +701,7 @@ glimpse(gapminder2022)

Observe that ``Rows: `r gapminder2022_rows` `` indicates that there are `r gapminder2022_rows` rows/observations in `gapminder2022`, where each row corresponds to one country. In other words, the *observational unit* is an individual country. Furthermore, observe that the variable `continent` is of type `<fct>`, which stands for _factor_, which is R's way of encoding categorical variables.

A full description of all the variables included in `un_member_states_2024` can be found by reading the associated help file (run `?un_member_states_2024` in the console). However, we fully describe only the `r ncol(gapminder2022)` variables we selected in `gapminder2022`:
A full description of all the variables included in `un_member_states_2024` can be found by reading the associated help file (run [`?un_member_states_2024`](https://moderndive.github.io/moderndive/reference/un_member_states_2024.html) in the console). However, we fully describe only the `r ncol(gapminder2022)` variables we selected in `gapminder2022`:

1. `country`: An identification variable of type character/text used to distinguish the 142 countries in the dataset.
1. `life_exp`: A numerical variable of that country's life expectancy at birth. This is the outcome variable $y$ of interest.
Expand All @@ -728,7 +728,7 @@ gapminder2022 |>
)
```

Random sampling will likely produce a different subset of 3 rows for you than what's shown. Now that we have looked at the raw values in our `gapminder2022` data frame and got a sense of the data, we compute summary statistics. We again apply `tidy_summary()` from the `moderndive` package. Recall that this function takes in a data frame, summarizes it, and returns commonly used summary statistics. We take our `gapminder2022` data frame, `select()` only the outcome and explanatory variables `life_exp` and `continent`, and pipe them into `tidy_summary()`:
Random sampling will likely produce a different subset of 3 rows for you than what's shown. Now that we have looked at the raw values in our `gapminder2022` data frame and got a sense of the data, we compute summary statistics. We again apply [`tidy_summary()`](https://moderndive.github.io/moderndive/reference/tidy_summary.html) from the `moderndive` package. Recall that this function takes in a data frame, summarizes it, and returns commonly used summary statistics. We take our `gapminder2022` data frame, `select()` only the outcome and explanatory variables `life_exp` and `continent`, and pipe them into [`tidy_summary()`](https://moderndive.github.io/moderndive/reference/tidy_summary.html):

```{r eval=FALSE}
gapminder2022 |> select(life_exp, continent) |> tidy_summary()
Expand All @@ -754,7 +754,7 @@ gapminder2022 |>
gapminder2022 |> count(continent)
```

The `tidy_summary()` output now reports summaries for categorical variables and for the numerical variables we reviewed before. Let's focus just on discussing the results for the categorical `factor` variable `continent`:
The [`tidy_summary()`](https://moderndive.github.io/moderndive/reference/tidy_summary.html) output now reports summaries for categorical variables and for the numerical variables we reviewed before. Let's focus just on discussing the results for the categorical `factor` variable `continent`:

- `n`: The number of non-missing entries for each group
- `group`: Breaks down a categorical variable into its unique levels. For this variable, it is corresponding to Africa, Asia, North and South America, Europe, and Oceania.
Expand Down Expand Up @@ -1097,7 +1097,7 @@ Recall in Subsection \@ref(model1points), we defined the following three concept
1. Fitted values $\widehat{y}$, or the value on the regression line for a given $x$ value
1. Residuals $y - \widehat{y}$, or the error between the observed value and the fitted value

We obtained these values and other values using the `get_regression_points()` function from the `moderndive` package. This time, however, we add an argument setting `ID = "country"`: this is telling the function to use the variable `country` in `gapminder2022` as an *identification variable* in the output. This will help contextualize our analysis by matching values to countries.
We obtained these values and other values using the [`get_regression_points()`](https://moderndive.com/v2/moderndive.github.io/moderndive/reference/get_regression_points.html) function from the `moderndive` package. This time, however, we add an argument setting `ID = "country"`: this is telling the function to use the variable `country` in `gapminder2022` as an *identification variable* in the output. This will help contextualize our analysis by matching values to countries.

```{r, eval=FALSE}
regression_points <- get_regression_points(life_exp_model, ID = "country")
Expand Down Expand Up @@ -1427,11 +1427,11 @@ Compute the sum of squared residuals by hand for each line and show that of thes

Recall in this chapter we introduced a wrapper function from the `moderndive` package:

- `get_regression_points()` that returns point-by-point information from a regression model in Subsection \@ref(model1points).
- [`get_regression_points()`](https://moderndive.com/v2/moderndive.github.io/moderndive/reference/get_regression_points.html) that returns point-by-point information from a regression model in Subsection \@ref(model1points).

What is going on behind the scenes with the <!-- `get_regression_table()` and --> `get_regression_points()` function? We mentioned in Subsection \@ref(model1table) that this was an example of a *wrapper function*. Such functions take other pre-existing functions and "wrap" them into single functions that hide the user from their inner workings. This way all the user needs to worry about is what the inputs look like and what the outputs look like. In this subsection, we'll "get under the hood" of these functions and see how the "engine" of these wrapper functions works.
What is going on behind the scenes with the <!-- `get_regression_table()` and --> [`get_regression_points()`](https://moderndive.com/v2/moderndive.github.io/moderndive/reference/get_regression_points.html) function? We mentioned in Subsection \@ref(model1table) that this was an example of a *wrapper function*. Such functions take other pre-existing functions and "wrap" them into single functions that hide the user from their inner workings. This way all the user needs to worry about is what the inputs look like and what the outputs look like. In this subsection, we'll "get under the hood" of these functions and see how the "engine" of these wrapper functions works.

The `get_regression_points()` function is a wrapper function, returning information about the individual points involved in a regression model like the fitted values, observed values, and the residuals. `get_regression_points()` \index{R packages!moderndive!get\_regression\_points()} uses the `augment()` \index{R packages!broom!augment()} function in the [`broom` package](https://broom.tidyverse.org/) <!-- instead of the `tidy()` function as with `get_regression_table()` --> to produce the data shown in Table \@ref(tab:regpoints-augment). Additionally, it uses `clean_names()` \index{R packages!janitor!clean\_names()} from the [`janitor` package](https://github.com/sfirke/janitor) [@R-janitor] to clean up the variable names.
The [`get_regression_points()`](https://moderndive.com/v2/moderndive.github.io/moderndive/reference/get_regression_points.html) function is a wrapper function, returning information about the individual points involved in a regression model like the fitted values, observed values, and the residuals. [`get_regression_points()`](https://moderndive.com/v2/moderndive.github.io/moderndive/reference/get_regression_points.html) \index{R packages!moderndive!get\_regression\_points()} uses the `augment()` \index{R packages!broom!augment()} function in the [`broom` package](https://broom.tidyverse.org/) <!-- instead of the `tidy()` function as with `get_regression_table()` --> to produce the data shown in Table \@ref(tab:regpoints-augment). Additionally, it uses `clean_names()` \index{R packages!janitor!clean\_names()} from the [`janitor` package](https://github.com/sfirke/janitor) [@R-janitor] to clean up the variable names.


```{r, eval=FALSE}
Expand Down
2 changes: 1 addition & 1 deletion 06-multiple-regression.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -576,7 +576,7 @@ the same slope: they are parallel as shown
in Figure \@ref(fig:numxcatx-parallel).

To plot parallel slopes we use the function
`geom_parallel_slopes()`\index{R packages!moderndive!geom\_parallel\_slopes()}
[`geom_parallel_slopes()`](https://moderndive.github.io/moderndive/reference/geom_parallel_slopes.html) \index{R packages!moderndive!geom\_parallel\_slopes()}
that is included in the `moderndive` package. To use this function you need
to load both the `ggplot2` and `moderndive` packages. Observe how the
code is identical to the one used for the model with interactions in
Expand Down
2 changes: 1 addition & 1 deletion 08-confidence-intervals.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,7 @@ The total number of almonds in the bowl is 5,000. The population mean is

$$\mu = \sum_{i=1}^{5000}\frac{x_i}{5000}=`r mu`,$$

and the population standard deviation, `pop_sd()`, from `moderndive`, is defined as
and the population standard deviation, [`pop_sd()`](https://moderndive.github.io/moderndive/reference/pop_sd.html), from `moderndive`, is defined as

$$\sigma = \sum_{i=1}^{5000} \frac{(x_i - \mu)^2}{5000}=`r sigma`.$$

Expand Down
2 changes: 1 addition & 1 deletion 09-hypothesis-testing.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -1794,7 +1794,7 @@ On top of the many common misunderstandings about hypothesis testing and $p$-val

1. [Misuse of $p$-values](https://en.wikipedia.org/wiki/Misuse_of_p-values)
2. [What a nerdy debate about $p$-values shows about science - and how to fix it](https://www.vox.com/science-and-health/2017/7/31/16021654/p-values-statistical-significance-redefine-0005)
3. [Statisticians issue warning over misuse of $P$ values](https://www.nature.com/news/statisticians-issue-warning-over-misuse-of-p-values-1.19503)
3. [Statisticians issue warning over misuse of $P$ values](https://www.nature.com/articles/nature.2016.19503)
4. [You Cannot Trust What You Read About Nutrition](https://fivethirtyeight.com/features/you-cant-trust-what-you-read-about-nutrition/)
5. [A Litany of Problems with p-values](http://www.fharrell.com/post/pval-litany/)

Expand Down
Loading

0 comments on commit e025016

Please sign in to comment.