forked from poole/lanyon
-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy path04-data-frames-3.Rmd
202 lines (147 loc) · 6.87 KB
/
04-data-frames-3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
layout: page
title: 04 -- Even more `data.frame`
---
# Quick review
## `[` and `[[`
### Challenges
* Given a vector of random numbers (picked from a normal distribution) `vec <- rnorm(100)`, retrieve the 5th, 10th and 15th elements from it.
* R has a special vector called `letters` that stores the letter of the alphabet. Use it in combination with the function `rep()` to create a vector of 100 that contains only `a` and `e`.
* Starting with `vec <- 1:100`, and using the function `seq()` create a vector that doesn't contain the multiples of 5 (i.e., 5, 10, ... 100).
# Removing rows
```{r, echo=FALSE, purl=FALSE}
## Removing rows
```
Typically, rows are not associated with names, so to remove them from the
`data.frame`, you can do:
```{r, results='show', purl=FALSE}
surveys <- read.csv(file="data/surveys.csv")
surveys_missingRows <- surveys[-c(10, 50:70), ] # removing rows 10, and 50 to 70
```
# Calculating statistics
```{r, echo=FALSE, purl=FALSE}
## Calculating statistics
```
Let's get a closer look at our data. For instance, we might want to know how
many animals we trapped in each plot, or how many of each species were caught.
To get a `vector` of all the species, we are going to use the `unique()`
function that tells us the unique values in a given vector:
```{r, purl=FALSE}
unique(surveys$species_id)
```
The function `table()`, tells us how many of each species we have:
```{r, purl=FALSE}
table(surveys$species_id)
```
R has a lot of built in statistical functions, like `mean()`, `median()`,
`max()`, `min()`. Let's start by calculating the average weight of all the
animals using the function `mean()`:
```{r, results='show', purl=FALSE}
mean(surveys$wgt)
```
Hmm, we just get `NA`. That's because we don't have the weight for every animal
and missing data is recorded as `NA`. By default, all R functions operating on a
vector that contains missing data will return NA. It's a way to make sure that
users know they have missing data, and make a conscious decision on how to deal
with it.
When dealing with simple statistics like the mean, the easiest way to ignore
`NA` (the missing data) is to use `na.rm=TRUE` (`rm` stands for remove):
```{r, results='show', purl=FALSE}
mean(surveys$wgt, na.rm=TRUE)
```
In some cases, it might be useful to remove the missing data from the
vector. For this purpose, R comes with the function `na.omit`:
```{r, purl=FALSE}
wgt_noNA <- na.omit(surveys$wgt)
```
For some applications, it's useful to keep all observations, for others, it
might be best to remove all observations that contain missing data. The function
`complete.cases()` removes any rows that contain at least one missing
observation:
```{r, purl=FALSE}
surveys_complete <- surveys[complete.cases(surveys), ]
```
If you want to remove only the observations that are missing data for one
variable, you can use the function `is.na()`. For instance, to create a new
dataset that only contains individuals that have been weighted:
```{r, purl=FALSE}
surveys_with_weights <- surveys[!is.na(surveys$weight), ]
```
### Challenge
1. To determine the number of elements found in a vector, we can use
use the function `length()` (e.g., `length(surveys$wgt)`). Using `length()`, how
many animals have not had their weights recorded?
1. What is the median weight for the males?
1. What is the range (minimum and maximum) weight?
1. Bonus question: what is the standard error for the weight? (hints: there is
no built-in function to compute standard errors, and the function for the
square root is `sqrt()`).
```{r, echo=FALSE, purl=TRUE}
## 1. To determine the number of elements found in a vector, we can use
## use the function `length()` (e.g., `length(surveys$wgt)`). Using `length()`, how
## many animals have not had their weights recorded?
## 2. What is the median weight for the males?
## 3. What is the range (minimum and maximum) weight?
## 4. Bonus question: what is the standard error for the weight? (hints: there is
## no built-in function to compute standard errors, and the function for the
## square root is `sqrt()`).
```
# Statistics across factor levels
```{r, echo=FALSE, purl=TRUE}
## Statistics across factor levels
```
What if we want the maximum weight for all animals, or the average for each
plot?
R comes with convenient functions to do this kind of operations, functions in
the `apply` family.
For instance, `tapply()` allows us to repeat a function across each level of a
factor. The format is:
```{r, eval=FALSE, purl=FALSE}
tapply(columns_to_do_the_calculations_on, factor_to_sort_on, function)
```
If we want to calculate the mean for each species (using the complete dataset):
```{r, purl=FALSE}
tapply(surveys_complete$wgt, surveys_complete$species_id, mean)
```
This produces some `NA` because R "remembers" all species that were found in the
original dataset, even if they didn't have any weight data associated with them
in the current dataset. To remove the `NA` and make things clearer, we can
redefine the levels for the factor "species" before calculating the means. Let's
also create an object to store these values:
```{r, purl=FALSE}
surveys_complete$species_id <- factor(surveys_complete$species_id)
species_mean <- tapply(surveys_complete$wgt, surveys_complete$species_id, mean)
```
### Challenge
1. Create new objects to store: the standard deviation, the maximum and minimum
values for the weight of each species
1. How many species do you have these statistics for?
1. Create a new data frame (called `surveys_summary`) that contains as columns:
* `species` the 2 letter code for the species names
* `mean_wgt` the mean weight for each species
* `sd_wgt` the standard deviation for each species
* `min_wgt` the minimum weight for each species
* `max_wgt` the maximum weight for each species
```{r, echo=FALSE, purl=TRUE}
## 1. Create new objects to store: the standard deviation, the maximum and minimum
## values for the weight of each species
## 2. How many species do you have these statistics for?
## 3. Create a new data frame (called `surveys_summary`) that contains as columns:
## * `species` the 2 letter code for the species names
## * `mean_wgt` the mean weight for each species
## * `sd_wgt` the standard deviation for each species
## * `min_wgt` the minimum weight for each species
## * `max_wgt` the maximum weight for each species
```
**Answers**
```{r, purl=FALSE}
species_max <- tapply(surveys_complete$wgt, surveys_complete$species_id, max)
species_min <- tapply(surveys_complete$wgt, surveys_complete$species_id, min)
species_sd <- tapply(surveys_complete$wgt, surveys_complete$species_id, sd)
nlevels(surveys_complete$species_id) # or length(species_mean)
surveys_summary <- data.frame(species=levels(surveys_complete$species_id),
mean_wgt=species_mean,
sd_wgt=species_sd,
min_wgt=species_min,
max_wgt=species_max)
```