Skip to content

Commit

Permalink
Updating book
Browse files Browse the repository at this point in the history
  • Loading branch information
rafalab committed May 13, 2024
1 parent 8811d69 commit 5aa6d7f
Show file tree
Hide file tree
Showing 86 changed files with 1,191 additions and 2,837 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,5 @@ fixsh.R
.DS_Store
*.tex
*.pdf
ziptex
ziptex
crc-stuff/
10 changes: 5 additions & 5 deletions R/R-basics.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -240,7 +240,7 @@ or
?">"
```
### Other prebuilt objects
### Prebuilt objects
There are several datasets that are included for users to practice and test out functions. You can see all the available datasets by typing:
Expand Down Expand Up @@ -288,7 +288,7 @@ Values remain in the workspace until you end your session or erase them with the
We actually recommend against saving the workspace this way because, as you start working on different projects, it will become harder to keep track of what is saved. Instead, we recommend you assign the workspace a specific name. You can do this by using the function `save` or `save.image`. To load, use the function `load`. When saving a workspace, we recommend the suffix `rda` or `RData`. In RStudio, you can also do this by navigating to the *Session* tab and choosing *Save Workspace as*. You can later load it using the *Load Workspace* options in the same tab. You can read the help pages on `save`, `save.image`, and `load` to learn more.
### Motivating scripts
### Why use scripts?
To solve another equation such as $3x^2 + 2x -1$, we can copy and paste the code above and then redefine the variables and recompute the solution:
Expand Down Expand Up @@ -352,7 +352,7 @@ To see that this is in fact a data frame, we type:
class(murders)
```
### Examining an object
### Examining objects
The function `str` is useful for finding out more about the structure of an object:
Expand Down Expand Up @@ -400,7 +400,7 @@ It is important to know that the order of the entries in `murders$population` pr
R comes with a very nice auto-complete functionality that saves us the trouble of typing out all the names. Try typing `murders$p` then hitting the *tab* key on your keyboard. This functionality and many other useful auto-complete features are available when working in RStudio.
:::
### Vectors: numerics, characters, and logical
### Vectors
The object `murders$population` is not one number but several. We call these types of objects *vectors*. A single number is technically a vector of length 1, but in general we use the term vectors to refer to objects with several entries. The function `length` tells you how many entries are in the vector:
Expand Down Expand Up @@ -467,7 +467,7 @@ In the background, R stores these *levels* as integers and keeps a map to keep t
Note that the levels have an order that is different from the order of appearance in the factor object. The default in R is for the levels to follow alphabetical order. However, often we want the levels to follow a different order. You can specify an order through the `levels` argument when creating the factor with the `factor` function. For example, in the murders dataset regions are ordered from east to west. The function `reorder` lets us change the order of the levels of a factor variable based on a summary computed on a numeric vector. We will demonstrate this with a simple example, and will see more advanced ones in the Data Visualization part of the book.
Suppose we want the levels of region ordered by the total number of murders rather than alphabetically. If there are values associated with each level, we can use the `reorder` function and specify a data summary to determine the order. The following code takes the sum of the total murders in each region, and reorders the factor following these sums.
Suppose we want the levels of region ordered by the total number of murders rather than alphabetically If there are values associated with each level, we can use the `reorder` function and specify a data summary to determine the order. The following code takes the sum of the total murders in each region, and reorders the factor following these sums.
```{r}
region <- murders$region
Expand Down
2 changes: 0 additions & 2 deletions R/data-table.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -229,8 +229,6 @@ To sort the table in descending order, we can order by the negative of `populati
murders_dt[order(population, decreasing = TRUE)]
```

### Nested sorting

Similarly, we can perform nested ordering by including more than one variable in order:

```{r, eval=FALSE}
Expand Down
2 changes: 1 addition & 1 deletion R/importing-data.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ We have been using datasets already stored as R objects. In data analysis work w

In this chapter, we outline how to load data from a file into R. First, it's crucial to identify the file's location; thus, we touch on file paths and working directories (detailed in @sec-unix). Next, we delve into file types (text or binary) and encodings (like ASCII and Unicode), both essential for data import. We then introduce popular functions for data importing, referred to as _parsers_. Lastly, we offer tips on how to store data in spreadsheets. Advanced topics like extracting data from websites or PDFs will be discussed in the book's Data Wrangling section.

## Paths and the working directory
## Navigating and managing the filesystem

The first step when importing data from a spreadsheet is to locate the file containing the data. Although we do not recommend it, you can use an approach similar to what you do to open files in Microsoft Excel by clicking on the RStudio "File" menu, clicking "Import Dataset", then clicking through folders until you find the file. However, we write code rather than use the point-and-click approach. The key concepts we need to learn to do this are described in detail in @sec-unix. Here we provide an overview of the very basics.

Expand Down
9 changes: 6 additions & 3 deletions R/tidyverse.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ Remember that the pipe sends values to the first argument, so we can define othe
Therefore, when using the pipe with data frames and __dplyr__, we no longer need to specify the required first argument since the __dplyr__ functions we have described all take the data as the first argument. In the code we wrote:

```{r, eval=FALSE}
murders |> select(state, region, rate) |> filter(rate <= 0.71)
murders |> select(state, region, rate) |> filter(rate <= 0.71)edsfsdf
```
`murders` is the first argument of the `select` function, and the new data frame (formerly `new_dataframe`) is the first argument of the `filter` function.

Expand All @@ -200,7 +200,7 @@ An important part of exploratory data analysis is summarizing data. The average
library(tidyverse)
```

### `summarize` {#sec-summarize}
### The `summarize` funciton {#sec-summarize}

The `summarize` function in __dplyr__ provides a way to compute summary statistics with intuitive and readable code. We start with a simple example based on heights. The `heights` dataset includes heights and sex reported by students in an in-class survey.

Expand Down Expand Up @@ -312,7 +312,7 @@ murders |>
summarize(median_min_max(rate))
```

### `pull`
### Extracting varialbes with `pull`

The `us_murder_rate` object defined in @sec-summarize represents just one number. Yet we are storing it in a data frame:

Expand Down Expand Up @@ -396,6 +396,9 @@ murders |> group_by(region) |> class()

The `tbl`, pronounced "tibble", is a special kind of data frame. The functions `group_by` and `summarize` always return this type of data frame. The `group_by` function returns a special kind of `tbl`, the `grouped_df`. We will say more about these later. For consistency, the __dplyr__ manipulation verbs (`select`, `filter`, `mutate`, and `arrange`) preserve the class of the input: if they receive a regular data frame they return a regular data frame, while if they receive a tibble they return a tibble. But tibbles are the preferred format in the tidyverse and as a result tidyverse functions that produce a data frame from scratch return a tibble. For example, in @sec-importing-data we will see that tidyverse functions used to import data create tibbles.


## Tibbles versus data frames

Tibbles are very similar to data frames. In fact, you can think of them as a modern version of data frames. Nonetheless there are some important differences which we describe next.


Expand Down
8 changes: 5 additions & 3 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,19 +50,21 @@ book:
- wrangling/joining-tables.qmd
- wrangling/dates-and-times.qmd
- wrangling/locales.qmd
- wrangling/data-table-wrangling.qmd
- wrangling/web-scraping.qmd
- wrangling/string-processing.qmd
- wrangling/text-analysis.qmd

- part: productivity/intro-productivity.qmd
chapters:
#- productivity/installing-r-and-rstudio.qmd
- productivity/installing-git.qmd
- productivity/unix.qmd
- productivity/git.qmd
- productivity/reproducible-projects.qmd

# - part: productivity/installations.qmd
# chapters:
# - productivity/installing-r-and-rstudio.qmd
# - productivity/installing-git.qmd

format:
html:
theme:
Expand Down
10 changes: 4 additions & 6 deletions dataviz/dataviz-in-practice.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,6 @@ library(dslabs)
gapminder |> as_tibble()
```

### Hans Rosling's quiz

As done in the *New Insights on Poverty* video, we start by testing our knowledge regarding differences in child mortality across different countries. For each of the six pairs of countries below, which country do you think had the highest child mortality rates in 2015? Which pairs do you think are most similar?

1. Sri Lanka or Turkey
Expand Down Expand Up @@ -213,8 +211,6 @@ gapminder |> filter(country %in% countries & !is.na(fertility)) |>

The plot clearly shows how South Korea's fertility rate dropped drastically during the 1960s and 1970s, and by 1990 had a similar rate to that of Germany.

### Labels instead of legends

For trend plots we recommend labeling the lines rather than using legends since the viewer can quickly see which line is which country. This suggestion actually applies to most plots: labeling is usually preferred over legends.

We demonstrate how we can do this using the `geomtextpath` package. We define a data table with the label locations and then use a second mapping just for these labels:
Expand Down Expand Up @@ -686,7 +682,7 @@ The 1988 paper has since been retracted and Andrew Wakefield was eventually "str
Effective communication of data is a strong antidote to misinformation and fear-mongering. In the introduction to this part of the book we showed an example, provided by a Wall Street Journal article^[<http://graphics.wsj.com/infectious-diseases-and-vaccines/>], showing data related to the impact of vaccines on battling infectious diseases. Here we reconstruct that example.


### Data
### Vaccine data

The data used for these plots were collected, organized, and distributed by the Tycho Project^[<http://www.tycho.pitt.edu/>]. They include weekly reported counts for seven diseases from 1928 to 2011, from all fifty states. We include the yearly totals in the __dslabs__ package:

Expand All @@ -709,7 +705,7 @@ dat <- us_contagious_diseases |>
```


### Trend plots and heatmaps
### Trend plots

We can now easily plot disease rates per year. Here are the measles data from California:

Expand All @@ -724,6 +720,8 @@ dat |> filter(state == "California" & !is.na(rate)) |>
We add a vertical line at 1963 since this is when the vaccine was introduced ^[Control, Centers for Disease; Prevention (2014). CDC health information for international travel 2014 (the yellow book). p. 250. ISBN 9780199948505].


## Heatmaps

Now can we show data for all states in one plot? We have three variables to show: year, state, and rate. In the WSJ figure, they use the x-axis for year, the y-axis for state, and color hue to represent rates. However, the color scale they use, which goes from yellow to blue to green to orange to red, can be improved.

In our example, we want to use a sequential palette since there is no meaningful center, just low and high rates.
Expand Down
4 changes: 2 additions & 2 deletions dataviz/dataviz-principles.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -324,7 +324,7 @@ heights |>
facet_grid(.~sex)
```

### Align plots vertically to see horizontal changes and horizontally to see vertical changes
### Aligning plots for comparisons

In these histograms, the visual cue related to decreases or increases in height are shifts to the left or right, respectively: horizontal changes. Aligning the plots vertically helps us see this change when the axes are fixed:

Expand Down Expand Up @@ -358,7 +358,7 @@ grid.arrange(p1, p2, p3, ncol = 3)

Notice how much more we learn from the two plots on the right. Barplots are useful for showing one number, but not very useful when we want to describe distributions.

## Consider transformations
## Transformations

We have motivated the use of the log transformation in cases where the changes are multiplicative. Population size was an example in which we found a log transformation to yield a more informative transformation.

Expand Down
2 changes: 1 addition & 1 deletion dataviz/distributions.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,7 @@ In data analysis we often divide observations into groups based on the values of

Stratification is common in data visualization because we are often interested in how the distributions of variables differ across different subgroups. We will see several examples throughout this part of the book, starting with the next section.

## Case study: describing student heights (continued) {#sec-student-height-cont}
## Case study continued {#sec-student-height-cont}

If we are convinced that the male height data is well approximated with a normal distribution we can report back to ET a very succinct summary: male heights follow a normal distribution with an average of `r round(m, 1)` inches and a SD of `r round(s,1)` inches. With this information, ET will have a good idea of what to expect when he meets our male students. However, to provide a complete picture we need to also provide a summary of the female heights.

Expand Down
Loading

0 comments on commit 5aa6d7f

Please sign in to comment.