Skip to content

Commit

Permalink
Fixed a graph and the acknowledgement section.
Browse files Browse the repository at this point in the history
  • Loading branch information
rafalab committed Dec 17, 2023
1 parent 54866e5 commit 2d3aa56
Show file tree
Hide file tree
Showing 39 changed files with 2,504 additions and 154 deletions.
23 changes: 9 additions & 14 deletions R/R-basics.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,6 @@ dat |>
plot.margin = unit(c(1,1,1,1), "cm"))
```

```{=html}
<!--(Source:
[Ma’ayan Rosenzweigh/ABC News](https://abcnews.go.com/blogs/headlines/2012/12/us-gun-ownership-homicide-rate-higher-than-other-developed-countries/), Data from UNODC Homicide Statistics) -->
```
Or even worse, this version from [everytown.org](https://everytownresearch.org)[^r-basics-2]:

[^r-basics-2]: <https://everytownresearch.org>
Expand All @@ -67,9 +63,9 @@ dat |>
axis.text.x = element_blank(),
axis.ticks.length = unit(-0.4, "cm")) +
coord_flip()
rm(dat)
```

<!--(Source [everytown.org](https://everytownresearch.org))-->

But then you remember that the US is a large and diverse country with 50 very different states as well as the District of Columbia (DC).

Expand Down Expand Up @@ -115,10 +111,6 @@ coef_c <- -1
which stores the values for later use. We use `<-` to assign values to the variables. We can also assign values using `=` instead of `<-`, but we recommend against using `=` to avoid confusion.
:::{.callout-note}
When writing code in R, it's important to choose variable names that are both meaningful and avoid conflicts with existing functions or reserved words in the language. The primary reason we used `coef_a`, `coef_b` and `coef_c` is to avoid a conflict with the `c()` function in R, described in @sec-creating-vectors. If you were to name a variable `c`, you would not receive an error or warning, but the conflict can lead to unexpected behavior and bugs that are hard to diagnose.
:::
Copy and paste the code above into your console to define the three variables. Note that R does not print anything when we make this assignment. This means the objects were defined successfully. Had you made a mistake, you would have received an error message.
To see the value stored in a variable, we simply ask R to evaluate `coef_a` and it shows the stored value:
Expand All @@ -140,6 +132,7 @@ We use the term *object* to describe stuff that is stored in R. Variables are ex
As we define objects in the console, we are actually changing the *workspace*. You can see all the variables saved in your workspace by typing:
```{r}
#| eval: false
ls()
```
Expand All @@ -160,7 +153,7 @@ Now since these values are saved in variables, to obtain a solution to our equat
Once you define variables, the data analysis process can usually be described as a series of *functions* applied to the data. R includes several predefined functions and most of the analysis pipelines we construct make extensive use of these.
We already used or discussed the `install.packages`, `library`, `c`, and `ls` functions. We also used the function `sqrt` to solve the quadratic equation above. There are many more prebuilt functions and even more can be added through packages. These functions do not appear in the workspace because you did not define them, but they are available for immediate use.
We already used or discussed the `install.packages`, `library`, and `ls` functions. We also used the function `sqrt` to solve the quadratic equation above. There are many more prebuilt functions and even more can be added through packages. These functions do not appear in the workspace because you did not define them, but they are available for immediate use.
In general, we need to use parentheses to evaluate a function. If you type `ls`, the function is not evaluated and instead R shows you the code that defines the function. If you type `ls()` the function is evaluated and, as seen above, we see objects in the workspace.
Expand Down Expand Up @@ -274,13 +267,15 @@ Inf + 1
### Variable names
We have used the letters `a`, `b`, and `c` as variable names, but variable names can be almost anything. Some basic rules in R are that variable names have to start with a letter, can't contain spaces, and should not be variables that are predefined in R. For example, don't name one of your variables `install.packages` by typing something like `install.packages <- 2`.
We used `coef_a`, `coef_b`, and `coef_c` as variable names, but variable names can be almost anything. When writing code in R, it's important to choose variable names that are both meaningful and avoid conflicts with existing functions or reserved words in the language. For example, we did not use `a`, `b` and `c` to avoid a conflict with the `c()` function in R, described in @sec-creating-vectors. If you were to name a variable `c`, you would not receive an error or warning, but the conflict can lead to unexpected behavior and bugs that are hard to diagnose.
Some basic rules in R are that variable names have to start with a letter, can't contain spaces, and should not be variables that are predefined in R, such as `c`.
A nice convention to follow is to use meaningful words that describe what is stored, use only lower case, and use underscores as a substitute for spaces. For the quadratic equations, we could use something like this:
A nice convention to follow is to use meaningful words that describe what is stored, use only lower case, and use underscores as a substitute for spaces. For the quadratic equations, we could use something like this for the two roots:
```{r}
solution_1 <- (-coef_b + sqrt(coef_b^2 - 4*coef_a*coef_c))/(2*coef_a)
solution_2 <- (-coef_b - sqrt(coef_b^2 - 4*coef_a*coef_c))/(2*coef_a)
r_1 <- (-coef_b + sqrt(coef_b^2 - 4*coef_a*coef_c))/(2*coef_a)
r_2 <- (-coef_b - sqrt(coef_b^2 - 4*coef_a*coef_c))/(2*coef_a)
```
For more advice, we highly recommend studying Hadley Wickham's style guide[^r-basics-3].
Expand Down
25 changes: 4 additions & 21 deletions R/getting-started.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -56,11 +56,7 @@ RStudio will be our launching pad for data science projects. It not only provide

### The panes

When you start RStudio for the first time, you will see three panes. The left pane shows the R console. On the right, the top pane includes tabs such as *Environment* and *History*, while the bottom pane shows five tabs: *File*, *Plots*, *Packages*, *Help*, and *Viewer* (these tabs may change in new versions). You can click on each tab to move across the different features.

![](../productivity/img/windows-screenshots/VirtualBox_Windows-7-Enterprise_22_03_2018_16_21_16.png){width="70%" fig-align="center"}

To start a new script, you can click on File, then New File, then R Script.
When you start RStudio for the first time, you will see three panes. The left pane shows the R console. On the right, the top pane includes tabs such as *Environment* and *History*, while the bottom pane shows five tabs: *File*, *Plots*, *Packages*, *Help*, and *Viewer* (these tabs may change in new versions). You can click on each tab to move across the different features. For example, to start a new script, you can click on File, then New File, then R Script.

![](../productivity/img/windows-screenshots/VirtualBox_Windows-7-Enterprise_22_03_2018_16_21_42.png){width="70%" fig-align="center"}

Expand All @@ -86,7 +82,7 @@ There are many editors specifically made for coding. These are useful because co

Let's start by opening a new script as we did before. A next step is to give the script a name. We can do this through the editor by saving the current new unnamed script. To do this, click on the save icon or use the key binding Ctrl+S on Windows and command+S on the Mac.\

When you ask for the document to be saved for the first time, RStudio will prompt you for a name. A good convention is to use a descriptive name, with lower case letters, no spaces, only hyphens to separate words, and then followed by the suffix *.R*. We will call this script *my-first-script.R*.
When you ask for the document to be saved for the first time, RStudio will prompt you for a name. A good convention is to use a descriptive name, with lower case letters, no spaces, only hyphens to separate words, and then followed by the suffix `.R`. We will call this script `my-first-script.R`.

![](../productivity/img/windows-screenshots/VirtualBox_Windows-7-Enterprise_22_03_2018_16_27_44.png){width="70%" fig-align="center"}

Expand All @@ -113,7 +109,7 @@ To change the global options you click on *Tools* then *Global Options...*.

As an example we show how to make a change that we **highly recommend**. This is to change the *Save workspace to .RData on exit* to *Never* and uncheck the *Restore .RData into workspace at start*. By default, when you exit R saves all the objects you have created into a file called .RData. This is done so that when you restart the session in the same folder, it will load these objects. We find that this causes confusion especially when we share code with colleagues and assume they have this .RData file. To change these options, make your *General* settings look like this:

![](../productivity/img/windows-screenshots/VirtualBox_Windows-7-Enterprise_22_03_2018_16_56_08.png){width="70%" fig-align="center"}
![](../productivity/img/windows-screenshots/VirtualBox_Windows-7-Enterprise_22_03_2018_16_56_08.png){width="40%" fig-align="center"}

## Installing R packages {#sec-installing-r-packages}

Expand All @@ -137,20 +133,7 @@ We can install more than one package at once by feeding a character vector to th
install.packages(c("tidyverse", "dslabs"))
```

One advantage of using RStudio is that it auto-completes package names once you start typing, which is helpful when you do not remember the exact spelling of the package:

![](../productivity/img/windows-screenshots/VirtualBox_Windows-7-Enterprise_22_03_2018_16_24_18.png){width="70%" fig-align="center"}


Once you select your package, we recommend selecting all the defaults:

:::{layout-ncol=2}

![](../productivity/img/windows-screenshots/VirtualBox_Windows-7-Enterprise_22_03_2018_16_24_36.png){width=50%}

![](../productivity/img/windows-screenshots/VirtualBox_Windows-7-Enterprise_22_03_2018_16_26_24.png){width=50%}

:::
One advantage of using RStudio is that it auto-completes package names once you start typing, which is helpful when you do not remember the exact spelling of the package.

Note that installing **tidyverse** actually installs several packages. This commonly occurs when a package has *dependencies*, or uses functions from other packages. When you load a package using `library`, you also load its dependencies.

Expand Down
4 changes: 2 additions & 2 deletions R/intro-to-R.qmd
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# R {.unnumbered}

In this book, we will be using the R software environment for all our analysis. You will learn R and data analysis techniques simultaneously. To follow along you will therefore need access to R. We also recommend the use of an *integrated development environment* (IDE), such as RStudio, to save your work. Note that it is common for a course or workshop to offer access to an R environment and an IDE through your web browser, as done by RStudio cloud[^r-basics-1]. If you have access to such a resource, you don't need to install R and RStudio. However, if you intend on becoming an advanced data analyst, we highly recommend installing these tools on your computer. Both R and RStudio are free and available online.
In this book, we will be using the R software environment for all our analysis. You will learn R and data analysis techniques simultaneously. To follow along you will therefore need access to R. We also recommend the use of an *integrated development environment* (IDE), such as RStudio, to save your work. Note that it is common for a course or workshop to offer access to an R environment and an IDE through your web browser, as done by RStudio cloud[^r-basics-0]. If you have access to such a resource, you don't need to install R and RStudio. However, if you intend on becoming an advanced data analyst, we highly recommend installing these tools on your computer. Both R and RStudio are free and available online.

[^r-basics-1]: <https://rstudio.cloud>
[^r-basics-0]: <https://rstudio.cloud>


4 changes: 0 additions & 4 deletions R/tidyverse.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -580,10 +580,6 @@ However, this can become cumbersome, especially within the tidyverse approach. T
between(x, a, b)
```

:::{.callout-note}
You are ready to do exercises 20-24.
:::

## Exercises

1\. Examine the built-in dataset `co2`. Which of the following is true:
Expand Down
11 changes: 11 additions & 0 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,17 @@ format:
code-link: true
author-meta: Rafael A. Irizarry
callout-appearance: simple
pdf:
documentclass: krantz
include-in-header: preamble.tex
header-includes: |
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{graphicx}
\usepackage{subfigure}
\usepackage{makeidx}
\usepackage{multicol}
knitr:
opts_chunk:
Expand Down
4 changes: 2 additions & 2 deletions dataviz/dataviz-principles.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -200,8 +200,8 @@ Not surprisingly, __ggplot2__ defaults to using area rather than
radius. Of course, in this case, we really should not be using area at all since we can use position and length:

```{r barplot-better-than-area, out.width="70%", echo=FALSE}
gdp_data |>
filter(y == "Area") |>
data.frame(Country = c("United States", "China", "Japan", "Germany", "France"), GDP = gdp) |>
mutate(Country = reorder(Country, GDP)) |>
ggplot(aes(Country, GDP)) +
geom_col(width = 0.5) +
ylab("GDP in trillions of US dollars")
Expand Down
Binary file added docs/Introduction-to-Data-Science.pdf
Binary file not shown.
Loading

0 comments on commit 2d3aa56

Please sign in to comment.