-
Notifications
You must be signed in to change notification settings - Fork 51
/
022_variables.Rmd
executable file
·149 lines (124 loc) · 4.77 KB
/
022_variables.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
<style>@import url(style.css);</style>
[Introduction to Data Analysis](index.html "Course index")
# 2.2. Variables and factors
In R, everything is an object. The list of objects in memory `ls()` is an object itself, as shown in the example below, which lists all objects in the current workspace and wipes them from memory.
```{r clear-workspace, eval=FALSE}
# List workspace objects.
ls()
# Erase entire workspace.
rm(list = ls())
```
Let's use some real, anonymized data from Autumn 2012. These are the grades from my three mathematics classes in first year. I have removed any student identification, so you have to trust me that these are the real grades (and yes, some grades range above 20/20!). The `downloader` package provides a handy command to download the data from the course repository, so we install and load it first.
```{r grades-import}
# Install downloader package.
if(!"downloader" %in% installed.packages()[, 1])
install.packages("downloader")
# Load package.
require(downloader)
# Target file.
file = "data/grades.2012.csv"
# Download the data if needed.
if (!file.exists(file)) {
# Locate the data.
url = "https://raw.github.com/briatte/ida/master/data/grades.2012.csv"
# Download the data.
download(url, file, mode = "wb")
}
# Load the data.
grades <- read.table(file, header = TRUE)
# Check result.
head(grades)
```
Let's now use a package to create fake names for the students. We again need to install and load the package first: in later sessions, we will use a standard code block to install-and-load packages.
```{r grades-with-random-names}
# Install randomNames package. Remember that R is case-sensitive.
if(!"randomNames" %in% installed.packages()[, 1])
install.packages("randomNames")
# Load package.
require(randomNames)
# How many rows of data do we have?
(count = nrow(grades))
# Let's generate that many random names.
names <- randomNames(count)
# Let's finally stick them to the matrix.
grades <- cbind(grades, names)
# Check result.
head(grades)
```
## Data frames
Let's show a final type of object: the data frame.
```{r df}
# Convert to data frame.
grades <- as.data.frame(grades)
# Check result.
head(grades)
# Check structure of a data frame.
str(grades)
```
Data frames are very malleable objects: we can rearrange the variables easily with commands like `melt` from the `reshape` package.
```{r melt, message=FALSE}
# Install and load reshape package.
if(!"reshape" %in% installed.packages()[, 1])
install.packages("reshape")
# Load package.
require(reshape)
# Reshape data from 'wide' (lots of columns) to 'long' (lots of rows).
grades <- melt(grades, id.vars = "names")
# Check result to show how each grade is now held on a separate row.
head(grades[order(grades$names), ])
```
Let's finish with a few plots.
```{r grades-plots, tidy = FALSE}
# Install and load ggplot2 package.
if(!"ggplot2" %in% installed.packages()[, 1])
install.packages("ggplot2")
# Load package.
require(ggplot2)
# Plot all three exams.
qplot(data = grades, x = value,
group = variable,
geom = "density")
# Add color and transparency.
qplot(data = grades, x = value,
color = variable,
fill = variable,
alpha = I(.3), geom = "density")
```
Now use the code on this page to:
1. Download [this data extract from the U.S. National Health Interview Survey 2005][nhis-data]. Use `RCurl` as shown above. Call the data `nhis`.
[nhis-data]: https://raw.github.com/briatte/ida/master/data/nhis.2005.csv
```{r nhis-data-import, include=FALSE}
file = "data/nhis.2005.csv"
# Download the data if needed.
if (!file.exists(file)) {
# Locate the data.
url = "https://raw.github.com/briatte/ida/master/data/nhis.2005.csv"
# Download the data.
download(url, file, mode = "wb")
}
# Load the data.
nhis <- read.table(file, header = TRUE)
# Check result.
head(nhis)
```
2. Create an object called `bmi` that corresponds to the Body Mass Index from the `height` and `weight` columns of the `nhis` object. Use the U.S. formula since the data use inches and pounds. Bind the `bmi` object to the `nhis` object.
```{r nhis-bmi, include=FALSE}
# Compute BMI vectore.
bmi <- 703 * nhis[ ,4] / nhis[ ,3]^2
# Bind to matrix.
nhis <- cbind(nhis, bmi)
```
3. Plot the results using `qplot(data = ..., x = ..., geom = "density")`.
```{r nhis-qplot-1, include=FALSE}
# The plot.
qplot(data = nhis, x = bmi, geom = "density")
```
4. Bonus question: explore how `ggplot2` works and produce plots with the `x` and `y` variables. Guess what they stand for.
```{r nhis-qplot-2, include=FALSE}
# Another plot. Variable x stands for gender.
qplot(data = nhis, x = bmi, group = x, color = factor(x), geom = "density")
# One more plot. Variable y stands for for racial group.
qplot(data = nhis, x = bmi, group = x, color = factor(x), geom = "density") +
facet_grid(y ~ .)
```
> __Next__: [Practice](023_practice.html).