-
Notifications
You must be signed in to change notification settings - Fork 7
/
chapter3.Rmd
535 lines (374 loc) · 16.1 KB
/
chapter3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
---
title: 'Looking at the data'
description: 'Data are everywhere and everything. That''s why Statistics is also called Data Science. R offers great tools for looking at the data, behind the numbers.'
---
## Getting intimate with the data
```yaml
type: NormalExercise
key: 9c7624db08
lang: r
xp: 100
skills: 1
```
In this section, you will learn how to get familiar with your data by visualizing and summarizing it.
These are always the first things to do after the data is collected and preprocessed (cleaned). These kinds of simple explorations are an important step and they create valuable insight into the data.
`@instructions`
- Take a look at the examples - you will learn more about all of them in the next exercises.
- You can browse the different plots by clicking on "Next/Previous plot".
- Click 'Submit Answer' to move to the first exercise.
`@hint`
- Simply click 'Submit Answer' to move to the next exercise.
`@pre_exercise_code`
```{r}
# load data from web
students2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS-data.txt", sep="\t", header=TRUE)
# keep a couple background variables
students2014 <- students2014[,c("sukup","toita","ika","pituus","kenka","kone")]
# recode kone variables missing values as factor levels
students2014$kone <- addNA(students2014$kone)
# keep only rows without missing values
students2014 <- students2014[complete.cases(students2014),]
# integers to numeric
students2014$ika <- as.numeric(students2014$ika)
students2014$pituus <- as.numeric(students2014$pituus)
students2014$kenka <- as.numeric(students2014$kenka)
# bar plot of operating systems
barplot(summary(students2014$kone), main = "Barplot of OS", xlab = "Linux, Mac, Win or none")
# histogram of heights
hist(students2014$pituus, main ="Histogram of student heights", xlab = "height")
# horizontally placed box plot of ages
boxplot(students2014$ika, main = "Boxplot of student ages", xlab = "age", horizontal = TRUE)
```
`@sample_code`
```{r}
# students2014 is available
# Bar plot of operating systems
barplot(summary(students2014$kone), main = "Barplot of OS", xlab = "Linux, Mac, Win or none")
# Histogram of student heights
hist(students2014$pituus, main ="Histogram of student heights", xlab = "height")
# Box plot of student ages
boxplot(students2014$ika, main = "Boxplot of student ages", xlab = "age", horizontal = TRUE)
```
`@solution`
```{r}
# students2014 is available
# Bar plot of operating systems
barplot(summary(students2014$kone), main = "Barplot of OS", xlab = "Linux, Mac, Win or none")
# Histogram of student heights
hist(students2014$pituus, main ="Histogram of student heights", xlab = "height")
# Box plot of student ages
boxplot(students2014$ika, main = "Boxplot of student ages", xlab = "age", horizontal = TRUE)
```
`@sct`
```{r}
test_error()
success_msg("Ready? Let's go for it!")
```
---
## Summary() statistics
```yaml
type: NormalExercise
key: 0e42548148
lang: r
xp: 100
skills: 1,3
```
Some functions in R are very generic. They can take different objects as their first argument and they seem to magically do exactly what you'd hope they do in all situations.
One such example is the `summary()` function. Depending on the type of data object it receives as it's first argument, `summary()` will produce different convenient results.
`@instructions`
- Execute the code that creates a summary of `sukup`. What does it show you?
- Execute the code that creates a summary of `ika`. What does it show you?
- Create a summary of `pituus` in `students2014`.
- Create a summary of the object `students2014`.
`@hint`
- Use the `$` -sign to access `pituus` in `students2014`.
`@pre_exercise_code`
```{r}
# load data from web
students2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS-data.txt", sep="\t", header=TRUE)
# keep a couple background variables
students2014 <- students2014[,c("sukup","toita","ika","pituus","kenka","kone")]
# recode kone variables missing values as factor levels
students2014$kone <- addNA(students2014$kone)
# keep only rows without missing values
students2014 <- students2014[complete.cases(students2014),]
# integers to numeric
students2014$ika <- as.numeric(students2014$ika)
students2014$pituus <- as.numeric(students2014$pituus)
students2014$kenka <- as.numeric(students2014$kenka)
```
`@sample_code`
```{r}
# students2014 is available
# Create a summary of a factor
summary(students2014$sukup)
# Create a summary of a numeric
summary(students2014$ika)
# Create a summary of 'pituus' in 'students2014'
# Create a summary of the data.frame 'students2014'
```
`@solution`
```{r}
# Create a summary of a factor object
summary(students2014$sukup)
# Create a summary of a numeric object
summary(students2014$ika)
# Create a summary of 'pituus' in 'students2014'
summary(students2014$pituus)
# Create a summary of the data.frame 'students2014'
summary(students2014)
```
`@sct`
```{r}
test_output_contains("summary(students2014$pituus)", incorrect_msg = "Please call summary() on the numeric object students2014$pituus.")
test_output_contains("summary(students2014)", incorrect_msg="Please call summary() on the data.frame object `students2014`")
test_error()
success_msg("Well done!")
```
---
## Bar plots of qualitative variables
```yaml
type: NormalExercise
key: 457280046c
lang: r
xp: 100
skills: 1
```
Bar plots are useful for visualizing qualitative, discrete variables. A bar plot represents the counts of the unique values of a variable.
In this exercise you will draw a barplot of `kone`. For each student the value of `kone` is the operating system of the students computer. For more information see [the metafile](http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS-meta.txt).
In base R graphics, the function `barplot()` can be used to draw a bar plot. The first argument (called `height`) should contain the counts you wish to visualize. The counts can be computed with `summary()` or more generally with `table()`.
`@instructions`
- Use `summary()` and `table()` on `kone` in the data frame `students2014`. Many of the values of `kone` are `<NA>`, not available. Why do you think this could be?
- Adjust the code on line 8. Instead of `NULL` (a special, empty data type), assign the counts of the different operating systems to the object `os_counts`. Use the `table()` function.
- Use the function `barplot()` to draw a bar plot. Use the counts of the different operating systems as the first argument.
`@hint`
- Replace `NULL` with the example code where `table()` is used on `kone`.
- Give the object `os_counts` as the first argument to `barplot()`.
`@pre_exercise_code`
```{r}
# load data from web
students2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS-data.txt", sep="\t", header=TRUE)
# keep a couple background variables
students2014 <- students2014[,c("sukup","toita","ika","pituus","kenka","kone")]
# recode kone variables missing values as factor levels
students2014$kone <- addNA(students2014$kone)
# keep only rows without missing values
students2014 <- students2014[complete.cases(students2014),]
# integers to numeric
students2014$ika <- as.numeric(students2014$ika)
students2014$pituus <- as.numeric(students2014$pituus)
students2014$kenka <- as.numeric(students2014$kenka)
```
`@sample_code`
```{r}
# students2014 is available
# 2 ways to look at the counts of different levels of 'kone'.
summary(students2014$kone)
table(students2014$kone)
# Create object os_counts that stores the counts
os_counts <- NULL
# Call barplot() to visualize the counts.
```
`@solution`
```{r}
# students2014 is available
# 2 ways to look at the counts of different levels of 'kone'
summary(students2014$kone)
table(students2014$kone)
# Create object os_counts that stores the counts
os_counts <- table(students2014$kone)
# Call barplot() to visualize the counts.
barplot(os_counts)
```
`@sct`
```{r}
test_object("os_counts", incorrect_msg = "Please use `table()` to create the object `os_counts`.")
test_function("barplot", args="height", not_called_msg = "Please call `barplot()` using the counts of the operating systems, created by `table()`.")
test_error()
success_msg("Excellent work!")
```
---
## Bar plots of continuous variables
```yaml
type: NormalExercise
key: 82217456fa
lang: r
xp: 100
skills: 1
```
When a variable is continuous by nature, it means that it can take on so many different values that all of it's observed values are likely to be unique. How to draw a bar plot in this case?
Age can be interpret as a continuous variable even though age is often measured quite discretely as rounded down years. Even then, usually there are too many unique values to visualize them clearly with a bar plot.
A solution is to first categorize the variable.
`@instructions`
- Use `table()` to look at the distribution of students age.
- Use `cut()` to create a factor object where age is cut into intervals. Use the vector `c(0, 16, 20, 24, 30, 100)` as break points.
- Use `table()` on the age intervals to create object `age_counts` and print it.
- Visualize the contents of the table with a bar plot.
`@hint`
- Use the argument `breaks = c(0, 16, 20, 24, 30, 100)` inside `cut()` to create the factor object `age_cut`.
- Make sure to close the parenthesis on `cut()`.
`@pre_exercise_code`
```{r}
# load data from web
students2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS-data.txt", sep="\t", header=TRUE)
# keep a couple background variables
students2014 <- students2014[,c("sukup","toita","ika","pituus","kenka","kone")]
# recode kone variables missing values as factor levels
students2014$kone <- addNA(students2014$kone)
# keep only rows without missing values
students2014 <- students2014[complete.cases(students2014),]
# integers to numeric
students2014$ika <- as.numeric(students2014$ika)
students2014$pituus <- as.numeric(students2014$pituus)
students2014$kenka <- as.numeric(students2014$kenka)
```
`@sample_code`
```{r}
# students2014 is available
# Print out a table of the students ages
table(students2014$ika)
# Split age to intervals
age_cut <- cut(students2014$ika, breaks = NULL)
# Create a table of the students age categories and print it.
age_counts <- NULL
# Draw a barplot() of the counts of the age categories.
```
`@solution`
```{r}
# students2014 is available
# Print out a table of the students ages
table(students2014$ika)
# Split age to intervals
age_cut <- cut(students2014$ika, breaks = c(0, 16, 20, 24, 30, 100))
# Create a table of the students age categories and print it.
age_counts <- table(age_cut)
age_counts
# Draw a barplot() of the age categories
barplot(age_counts)
```
`@sct`
```{r}
test_object("age_cut", incorrect_msg = "Please use `breaks = c(0, 16, 20, 24, 30, 100)` to create the object age_cut.")
test_object("age_counts",incorrect_msg = "Please create the object `age_counts` using `table()`.")
test_output_contains("age_counts", incorrect_msg = "Please print out the contents of `age_counts`.")
test_function("barplot", args = "height", not_called_msg = "Please use `barplot()` to visualize the counts of the age categories.")
test_error()
success_msg("Nicely done. That was awsome!")
```
---
## Histograms
```yaml
type: NormalExercise
key: 86cee4bd71
lang: r
xp: 100
skills: 1
```
Splitting a continuous variable into intervals of equal width and then drawing a bar plot is so common that the procedure has it's own name: the histogram.
In R, basic histograms can be drawn with the `hist()` function. The first argument of `hist()` should be a numeric vector of data values. The argument `breaks` can be used to approximately control the number of intervals.
The plotting functions in R also have many additional arguments which allow for more control over the plot. One simple argument is `col`, which can be the number(s) or the name(s) of the desired color(s). Some possible color names can be found <a target="_blank" href="http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf">here</a>.
Multiple arguments are always separeted with a comma `,`.
`@instructions`
- Adjust the sample code and draw the histogram again by setting the argument `breaks = 20`.
- Use the `col` argument to give the histogram a nice color of your choice. Use quotations marks to define the color name.
- Set the argument `xlab = "height"`.
- Set the argument `main = "My histogram"`
`@hint`
- All the extra arguments in this exercise go inside the parenthesis in `hist()`.
- Remember to seperate the arguments of a function with a comma.
- Use quotation marks for the color names and the main and xlab titles.
`@pre_exercise_code`
```{r}
# load data from web
students2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS-data.txt", sep="\t", header=TRUE)
# keep a couple background variables
students2014 <- students2014[,c("sukup","toita","ika","pituus","kenka","kone")]
# recode kone variables missing values as factor levels
students2014$kone <- addNA(students2014$kone)
# keep only rows without missing values
students2014 <- students2014[complete.cases(students2014),]
# integers to numeric
students2014$ika <- as.numeric(students2014$ika)
students2014$pituus <- as.numeric(students2014$pituus)
students2014$kenka <- as.numeric(students2014$kenka)
# Draw a histogram of student heights
hist(students2014$pituus)
```
`@sample_code`
```{r}
# students2014 is available
# Draw a histogram of student heights
hist(students2014$pituus)
```
`@solution`
```{r}
# students2014 is available
# Draw a histogram of student heights
hist(students2014$pituus, breaks = 20, col = "blue", xlab = "height", main = "My histogram")
```
`@sct`
```{r}
# submission correctness tests
test_function("hist", args=c("x","breaks","main","xlab"))
test_error()
success_msg("Very good, young padawan, but you still have much to learn.")
```
---
## Box plots
```yaml
type: NormalExercise
key: c3d62c54f5
lang: r
xp: 100
skills: 1
```
One very good way to visualize the summary statistics of a numerical variable is by drawing box plots. A box plot visualizes the 25th, 50th and 75th percentiles (the box), the typical range (the whiskers) and the outliers (the dots) of a variable.
The whiskers extending from the box can be computed by several techniques. The default in R is to extend them to reach to a data point that is no more than 1.5*IQR away from the box, where IQR is the inter quartile range defined as
`IQR = 75th percentile - 25th percentile`
Values outside the whiskers can be considered as outliers, unusually distant observations. For more information on IQR, see <a target="_blank" href ="https://en.wikipedia.org/wiki/Interquartile_range"> wikipedia</a>
`@instructions`
- Create a summary of `students2014$ika`.
- Use `boxplot()` on `students2014$ika`.
- Adjust the plot: set the argument `horizontal = TRUE`.
- Adjust the plot: set the argument `xlab = "age"`.
- Give your plot the main title "Box plot of student ages" with the argument `main`.
`@hint`
- Give `students2014$ika` as the first argument of `boxplot()`.
- Remember to separate function arguments with a comma.
- Use quotation marks when adding text.
`@pre_exercise_code`
```{r}
# load data from web
students2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS-data.txt", sep="\t", header=TRUE)
# keep a couple background variables
students2014 <- students2014[,c("sukup","toita","ika","pituus","kenka","kone")]
# recode kone variables missing values as factor levels
students2014$kone <- addNA(students2014$kone)
# keep only rows without missing values
students2014 <- students2014[complete.cases(students2014),]
# integers to numeric
students2014$ika <- as.numeric(students2014$ika)
students2014$pituus <- as.numeric(students2014$pituus)
students2014$kenka <- as.numeric(students2014$kenka)
```
`@sample_code`
```{r}
# students2014 is available
# Create a summary of student ages
summary(students2014$ika)
# Visualize the distribution of student ages with a boxplot()
```
`@solution`
```{r}
# students2014 is available
# Create a summary of student ages
summary(students2014$ika)
# Visualize the distribution of student ages with a boxplot()
boxplot(students2014$ika, horizontal = T, xlab = "age", main = "Box plot of student ages")
```
`@sct`
```{r}
test_function("boxplot", args=c("x","horizontal","xlab","main"))
test_error()
success_msg("That was nice work!")
```