forked from PsyTeachR/ads-v1
-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy path04-data.qmd
658 lines (469 loc) · 29 KB
/
04-data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
# Data Import {#sec-data}
## Intended Learning Outcomes {#sec-ilo-data .unnumbered}
* Be able to inspect data
* Be able to import data from a range of sources
* Be able to identify and handle common problems with data import
## Walkthrough video {#sec-walkthrough-data .unnumbered}
There is a walkthrough video of this chapter available via [Echo360.](https://echo360.org.uk/media/52d2249e-a737-42b4-bf55-267e39fc05c5/public) Please note that there may have been minor edits to the book since the video was recorded. Where there are differences, the book should always take precedence.
## Set-up {#sec-setup-data}
Create a new project for the work we'll do in this chapter named `r path("04-data")`. Then, create and save a new R Markdown document named `data.Rmd`, get rid of the default template text, and load the packages in the set-up code chunk. You should have all of these packages installed already, but if you get the message `Error in library(x) : there is no package called ‘x’`, please refer to @sec-install-package.
```{r setup-data, message=FALSE, verbatim="r setup, include=FALSE"}
library(tidyverse) # includes readr & tibble
library(rio) # for almost any data import/export
library(haven) # for SPSS, Stata,and SAS files
library(readxl) # for Excel files
library(googlesheets4) # for Google Sheets
```
We'd recommend making a new code chunk for each different activity, and using the white space to make notes on any errors you make, things you find interesting, or questions you'd like to ask the course team.
Download the [Data import cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-import.pdf).
## Built-in data {#sec-builtin}
You'll likely want to import you own data to work with, however, Base R also comes with built-in datasets and these can be very useful for learning new functions and packages. Additionally, some packages, like <pkg>tidyr</pkg>, also contain data. The `data()` function lists the datasets available.
```{r built-in-data, eval = FALSE}
# list datasets built in to base R
data()
# lists datasets in a specific package
data(package = "tidyr")
```
Type the name of a dataset into the `r glossary("console")` to see the data. For example, type `?table1` into the console to see the dataset description for `table1`, which is a dataset included with <pkg>tidyr</pkg>.
```{r, eval = FALSE}
?table1
```
You can also use the `data()` function to load a dataset into your `r glossary("global environment")`.
```{r}
# loads table1 into the environment
data("table1")
```
## Looking at data
Now that you've loaded some data, look the upper right hand window of RStudio, under the Environment tab. You will see the object `table1` listed, along with the number of observations (rows) and variables (columns). This is your first check that everything went OK.
**Always, always, always, look at your data once you've created or loaded a table**. Also look at it after each step that transforms your table. There are three main ways to look at your table: `View()`, `print()`, `tibble::glimpse()`.
### View()
A familiar way to look at the table is given by `View()` (uppercase 'V'), which opens up a data table in the console pane using a viewer that looks a bit like Excel. This command can be useful in the console, but don't ever put this one in a script because it will create an annoying pop-up window when the user runs it. You can also click on an object in the `r glossary("panes", "environment pane")` to open it in the same interface. You can close the tab when you're done looking at it; it won't remove the object.
```{r, eval = FALSE}
View(table1)
```
### print()
The `print()` method can be run explicitly, but is more commonly called by just typing the variable name on a blank line. The default is not to print the entire table, but just the first 10 rows.
Let's look at the `table1` table that we loaded above. Depending on how wide your screen is, you might need to click on an arrow at the right of the table to see the last column.
```{r print, eval = FALSE}
# call print explicitly
print(table1)
# more common method of just calling object name
table1
```
```{r print2, echo = FALSE}
# more common method of just calling object name
table1
```
### glimpse()
The function `tibble::glimpse()` gives a sideways version of the table. This is useful if the table is very wide and you can't easily see all of the columns. It also tells you the `r glossary("data type")` of each column in angled brackets after each column name.
```{r sw_glimpse}
glimpse(table1)
```
### summary() {#sec-summary-function}
You can get a quick summary of a dataset with the `summary()` function, which can be useful for spotting things like if the minimum or maximum values are clearly wrong, or if R thinks that a `r glossary("nominal")` variable is `r glossary("numeric")`. For example, if you had labelled gender as 1, 2, and 3 rather than male, female, and non-binary, `summary()` would calculate a mean and median even though this isn't appropriate for the data. This can be a useful flag that you need to take further steps to correct your data.
Note that because `population` is a very, very large number, R will use [scientific notation](https://courses.lumenlearning.com/waymakerintermediatealgebra/chapter/read-writing-scientific-notation-2/).
```{r}
summary(table1)
```
## Importing data {#sec-import_data}
Built-in data are nice for examples, but you're probably more interested in your own data. There are many different types of files that you might work with when doing data analysis. These different file types are usually distinguished by the three-letter `r glossary("extension")` following a period at the end of the file name (e.g., `.xls`).
Download this [directory of data files](data/data.zip), unzip the folder, and save the `data` directory in the `04-data` project directory.
```{r importing-data-setup, include=FALSE}
demo <- tibble(
character = LETTERS[1:6],
factor = factor(rep(c("high", "low", "med"), 2),
levels = c("low", "med", "high")),
integer = 1:6,
double = 1.5:6.5,
logical = c(T, T, F, F, NA, T),
date = lubridate::today() - 0:5
)
export(demo, "data/demo.csv")
export(demo, "data/demo.tsv")
export(demo, "data/demo.xlsx", overwrite = TRUE)
export(demo, "data/demo.sav")
export(demo, "data/demo.json")
```
### rio::import()
The type of data files you have to work with will likely depend on the software that you typically use in your workflow. The <pkg>rio</pkg> package has very straightforward functions for reading and saving data in most common formats: `rio::import()` and `rio::export()`.
```{r}
demo_tsv <- import("data/demo.tsv") # tab-separated values
demo_csv <- import("data/demo.csv") # comma-separated values
demo_xls <- import("data/demo.xlsx") # Excel format
demo_sav <- import("data/demo.sav") # SPSS format
```
### File type specific import
However, it is also useful to know the specific functions that are used to import different file types because it is easier to discover features to deal with complicated cases, such as when you need to skip rows, rename columns, or choose which Excel sheet to use.
```{r, message=FALSE}
demo_tsv <- readr::read_tsv("data/demo.tsv")
demo_csv <- readr::read_csv("data/demo.csv")
demo_xls <- readxl::read_excel("data/demo.xlsx")
demo_sav <- haven::read_sav("data/demo.sav")
```
::: {.callout-note .try}
Look at the help for each function above and read through the Arguments section to see how you can customise import.
:::
If you keep data in Google Sheets, you can access it directly from R using `<pkg>googlesheets4", "https://googlesheets4.tidyverse.org/")`. The code below imports data from a [public sheet](https://docs.google.com/spreadsheets/d/16dkq0YL0J7fyAwT1pdgj1bNNrheckAU_2-DKuuM6aGI){target="_blank"}. You can set the `ss` argument to the entire `r glossary("URL")` for the target sheet, or just the section after "https://docs.google.com/spreadsheets/d/".
```{r, message = FALSE}
gs4_deauth() # skip authorisation for public data
demo_gs4 <- googlesheets4::read_sheet(
ss = "16dkq0YL0J7fyAwT1pdgj1bNNrheckAU_2-DKuuM6aGI"
)
```
### Column data types {#sec-col_types}
Use `glimpse()` to see how these different functions imported the data with slightly different data types. This is because the different file types store data slightly differently. For example, SPSS stores factors as numbers, so the `factor` column contains the values 1, 2, 3 rather than `low`, `med`, `high`. It also stores `r glossary("logical")` values as 0 and 1 instead or TRUE and FALSE.
```{r}
glimpse(demo_csv)
```
```{r}
glimpse(demo_xls)
```
```{r}
glimpse(demo_sav)
```
```{r}
glimpse(demo_gs4)
```
The <pkg>readr</pkg> functions display a message when you import data explaining what `r glossary("data type")` each column is.
```{r}
demo <- readr::read_csv("data/demo.csv")
```
The "Column specification" tells you which `r glossary("data type")` each column is. You can review data types in @sec-data-types. Options are:
* `chr`: `r glossary("character")`
* `dbl`: `r glossary("double")`
* `lgl`: `r glossary("logical")`
* `int`: `r glossary("integer")`
* `date`: date
* `dttm`: date/time
`read_csv()` will guess what type of data each variable is and normally it is pretty good at this. However, if it makes a mistake, such as reading the "date" column as a `r glossary("character")`, you can manually set the column data types.
First, run `spec()` on the dataset which will give you the full column specification that you can copy and paste:
```{r}
spec(demo)
```
Then, we create an object using the code we just copied that lists the correct column types. Factor columns will always import as character data types, so you have to set their data type manually with `col_factor()` and set the order of levels with the `levels` argument. Otherwise, the order defaults to the order they appear in the dataset. For our `demo` dataset, we will tell R that the `factor` variable is a factor by using `col_factor()` and we can also specify the order of the levels so that they don't just appear alphabetically. Additionally, we can also specify exactly what format our `date` variable is in using `%Y-%m-%d`.
We then save this column specification to an object, and then add this to the `col_types` argument when we call `read_csv()`.
```{r}
corrected_cols <- cols(
character = col_character(),
factor = col_factor(levels = c("low", "med", "high")),
integer = col_integer(),
double = col_double(),
logical = col_logical(),
date = col_date(format = "%Y-%m-%d")
)
demo <- readr::read_csv("data/demo.csv", col_types = corrected_cols)
```
::: {.callout-note}
For dates, you might need to set the format your dates are in. See `?strptime` for a list of the codes used to represent different date formats. For example, `"%d-%b-%y"` means that the dates are formatted like `31-Jan-21`.
:::
The functions from <pkg>readxl</pkg> for loading `.xlsx` sheets have a different, more limited way to specify the column types. You will have to convert factor columns and dates using `mutate()`, which you'll learn about in @sec-wrangle, so most people let `read_excel()` guess data types and don't set the `col_types` argument.
For SPSS data, whilst `rio::import()` will just read the numeric values of factors and not their labels, the function `read_sav()` from <pkg>haven</pkg> reads both. However, you have to convert factors from a haven-specific "labelled double" to a factor (we have no idea why haven doesn't do this for you).
```{r}
demo_sav$factor <- haven::as_factor(demo_sav$factor)
glimpse(demo_sav)
```
::: {.callout-note}
The way you specify column types for <pkg>googlesheets4</pkg> is a little different from <pkg>readr</pkg>, although you can also use the shortcodes described in the help for `read_sheet()` with <pkg>readr</pkg> functions. There is currently no column specification for factors.
:::
## Creating data
If you need to create a small data table from scratch in R, use the `tibble::tibble()` function, and type the data right in. The `tibble` package is part of the `r glossary("tidyverse")` package that we loaded at the start of this chapter.
Let's create a small table with the names of three [Avatar](https://en.wikipedia.org/wiki/Avatar:_The_Last_Airbender) characters and their bending type. The `tibble()` function takes `r glossary("argument", "arguments")` with the names that you want your columns to have. The values are `r glossary("vector", "vectors")` that list the column values in order.
If you don't know the value for one of the cells, you can enter `r glossary("NA")`, which we have to do for Sokka because he doesn't have any bending ability. If all the values in the column are the same, you can just enter one value and it will be copied for each row.
```{r tibble-define}
avatar <- tibble(
name = c("Katara", "Toph", "Sokka"),
bends = c("water", "earth", NA),
friendly = TRUE
)
# print it
avatar
```
You can also use the `tibble::tribble()` function to create a table by row, rather than by column. You start by listing the column names, each preceded by a tilde (`~`), then you list the values for each column, row by row, separated by commas (don't forget a comma at the end of each row).
```{r tribble-define}
avatar_by_row <- tribble(
~name, ~bends, ~friendly,
"Katara", "water", TRUE,
"Toph", "earth", TRUE,
"Sokka", NA, TRUE
)
```
::: {.callout-note}
You don't have to line up the columns in a tribble, but it can make it easier to spot errors.
:::
You may not need to do this very often if you are primarily working with data that you import from spreadsheets, but it is useful to know how to do it anyway.
## Writing data
If you have data that you want to save, use `rio::export()`, as follows.
```{r write_csv, eval = FALSE}
export(avatar, "data/avatar.csv")
```
This will save the data in CSV format to your working directory.
Writing to Google Sheets is a little trickier (if you never use Google Sheets feel free to skip this section). Even if a Google Sheet is publicly editable, you can't add data to it without authorising your account.
You can authorise interactively using the following code (and your own email), which will prompt you to authorise "Tidyverse API Packages" the first time you do this. If you don't tick the checkbox authorising it to "See, edit, create, and delete all your Google Sheets spreadsheets", the next steps will fail.
```{r, eval = FALSE}
# authorise your account
# this only needs to be done once per script
gs4_auth(email = "[email protected]")
# create a new sheet
sheet_id <- gs4_create(name = "demo-file",
sheets = "letters")
# define the data table to save
letter_data <- tibble(
character = LETTERS[1:5],
integer = 1:5,
double = c(1.1, 2.2, 3.3, 4.4, 5.5),
logical = c(T, F, T, F, T),
date = lubridate::today()
)
write_sheet(data = letter_data,
ss = sheet_id,
sheet = "letters")
## append some data
new_data <- tibble(
character = "F",
integer = 6L,
double = 6.6,
logical = FALSE,
date = lubridate::today()
)
sheet_append(data = new_data,
ss = sheet_id,
sheet = "letters")
# read the data
demo <- read_sheet(ss = sheet_id, sheet = "letters")
```
::: {.callout-note .try}
* Create a new table called `family` with the first name, last name, and age of your family members (biological, adopted, or chosen).
* Save it to a CSV file called "family.csv".
* Clear the object from your environment by restarting R or with the code `remove(family)`.
* Load the data back in and view it.
```{r, eval = FALSE, webex.hide="Solution"}
# create the table
family <- tribble(
~first_name, ~last_name, ~age,
"Lisa", "DeBruine", 45,
"Robbie", "Jones", 14
)
# save the data to CSV
export(family, "data/family.csv")
# remove the object from the environment
remove(family)
# load the data
family <- import("data/family.csv")
```
:::
We'll be working with `r glossary("tabular data")` a lot in this class, but tabular data is made up of `r glossary("vector", "vectors")`, which groups together data with the same basic `r glossary("data type")`. @sec-data-types explains some of this terminology to help you understand the functions we'll be learning to process and analyse data.
## Troubleshooting
What if you import some data and it guesses the wrong column type? The most common reason is that a numeric column has some non-numbers in it somewhere. Maybe someone wrote a note in an otherwise numeric column. Columns have to be all one data type, so if there are any characters, the whole column is converted to character strings, and numbers like `1.2` get represented as `"1.2"`, which will cause very weird errors like `"100" < "9" == TRUE`. You can catch this by using `glimpse()` to check your data.
The data directory you downloaded contains a file called "mess.csv". Let's try loading this dataset.
```{r}
mess <- rio::import("data/mess.csv")
```
When importing goes wrong, it's often easier to fix it using the specific importing function for that file type (e.g., use `read_csv()` rather than `rio::import()`. This is because the problems tend to be specific to the file format and you can look up the help for these functions more easily. For CSV files, the import function is `readr::read_csv`.
```{r}
# lazy = FALSE loads the data right away so you can see error messages
# this default changed in late 2021 and might change back soon
mess <- read_csv("data/mess.csv", lazy = FALSE)
```
You'll get a warning about parsing issues and the data table is just a single column. View the file `data/mess.csv` by clicking on it in the File pane, and choosing "View File". Here are the first 10 lines. What went wrong?
`r head(mess)`
First, the file starts with a note: "This is my messy dataset" and then a blank line. The first line of data should be the column headings, so we want to skip the first two lines. You can do this with the argument `skip` in `read_csv()`.
```{r mess}
mess <- read_csv("data/mess.csv",
skip = 2,
lazy = FALSE)
glimpse(mess)
```
OK, that's a little better, but this table is still a serious mess in several ways:
* `junk` is a column that we don't need
* `order` should be an integer column
* `good` should be a logical column
* `good` uses all kinds of different ways to record TRUE and FALSE values
* `min_max` contains two pieces of numeric information, but is a character column
* `date` should be a date column
We'll learn how to deal with this mess in @sec-tidy and @sec-wrangle, but we can fix a few things by setting the `col_types` argument in `read_csv()` to specify the column types for our two columns that were guessed wrong and skip the "junk" column. The argument `col_types` takes a list where the name of each item in the list is a column name and the value is from the table below. You can use the function, like `col_double()` or the abbreviation, like `"d"`; for consistency with earlier in this chapter we will use the function names. Omitted column names are guessed.
| function | |abbreviation | type |
|:---------|:--------------|:-----|
| col_logical() | l | logical values |
| col_integer() | i | integer values |
| col_double() | d | numeric values |
| col_character() | c | strings |
| col_factor(levels, ordered) | f | a fixed set of values |
| col_date(format = "") | D | with the locale's date_format |
| col_time(format = "") | t | with the locale's time_format |
| col_datetime(format = "") | T | ISO8601 date time |
| col_number() | n | numbers containing the grouping_mark |
| col_skip() | _, - | don't import this column |
| col_guess() | ? | parse using the "best" type based on the input |
```{r tidier}
# omitted values are guessed
# ?col_date for format options
ct <- cols(
junk = col_skip(), # skip this column
order = col_integer(),
good = col_logical(),
date = col_date(format = "%Y-%m-%d")
)
tidier <- read_csv("data/mess.csv",
skip = 2,
col_types = ct,
lazy = FALSE)
```
You will get a message about parsing issues when you run this that tells you to run the `problems()` function to see a table of the problems. Warnings look scary at first, but always start by reading the message.
```{r, eval = FALSE}
problems()
```
```{r, echo = FALSE}
prob <- tibble(
row = 3L,
col = 2L,
expected = "an integer",
actual = "missing",
file = "data/mess.csv"
)
prob
```
The output of `problems()` tells you what row (`r prob$row[[1]]`) and column (`r prob$col[[1]]`) the error was found in, what kind of data was expected (`r prob$expected[[1]]`), and what the actual value was (`r prob$actual[[1]]`). If you specifically tell `read_csv()` to import a column as an integer, any characters (i.e., not numbers) in the column will produce a warning like this and then be recorded as `NA`. You can manually set what missing values are recorded as with the `na` argument.
```{r}
tidiest <- read_csv("data/mess.csv",
skip = 2,
na = "missing",
col_types = ct,
lazy = FALSE)
```
Now `order` is an integer variable where any empty cells contain `NA`. The variable `good` is a logical value, where `0` and `F` are converted to `FALSE`, while `1` and `T` are converted to `TRUE`. The variable `date` is a date type (adding leading zeros to the day). We'll learn in later chapters how to fix other problems, such as the `min_max` column containing two different types of data.
```{r tidiest-table, echo = FALSE}
head(tidiest)
```
## Working with real data
It's worth highlighting at this point that working with real data can be difficult because each dataset can be messy in its own way. Throughout this course we will show you common errors and how to fix them, but be prepared that when you start with working your own data, you'll likely come across problems we don't cover in the course and that's just part of joy of learning programming. You'll also get better at looking up solutions using sites like [Stack Overflow](https://stackoverflow.com/) and there's a fantastic [#rstats](https://twitter.com/hashtag/rstats) community on Twitter you can ask for help.
You may also be tempted to fix messy datasets by, for example, opening up Excel and editing them there. Whilst this might seem easier in the short term, there's two serious issues with doing this. First, you will likely work with datasets that have recurring messy problems. By taking the time to solve these problems with code, you can apply the same solutions to a large number of future datasets so it's more efficient in the long run. Second, if you edit the spreadsheet, there's no record of what you did. By solving these problems with code, you do so reproducibly and you don't edit the original data file. This means that if you make an error, you haven't lost the original data and can recover.
## Exercises
For the final step in this chapter, we will create a report using one of the in-built datasets to practice the skills you have used so far. You may need to refer back to previous chapters to help you complete these exercises and you may also want to take a break before you work through this section. We'd also recommend you knit at every step so that you can see how your output changes.
### New Markdown {#sec-exercises-new-rmd-4}
Create and save a new R Markdown document named `starwars_report.Rmd`. In the set-up code chunk load the packages `tidyverse` and `rio`.
We're going to use the built-in `starwars` dataset that contains data about Star Wars characters. You can learn more about the dataset by using the `?help` function.
### Import and export the dataset {#sec-exercises-load}
* First, load the in-built dataset into the environment. Type and run the code to do this in the console; do not save it in your Markdown.
* Then, export the dataset to a .csv file and save it in your `data` directory. Again, do this in the console.
* Finally, import this version of the dataset using `read_csv()` to an object named `starwars` - you can put this code in your Markdown.
`r hide()`
```{r eval = TRUE}
data(starwars)
export(starwars, "data/starwars.csv")
starwars <- read_csv("data/starwars.csv")
```
`r unhide()`
### Convert column types
* Check the column specification of `starwars`.
* Create a new column specification that lists the following columns as factors: `hair_color`, `skin_color`, `eye_color`, `sex`, `gender`, `homeworld`, and `species` and skips the following columns: `films`, `vehicles`, and `starships` (this is because these columns contain multiple values and are stored as lists, which we haven't covered how to work with). You do not have to set the factor orders (although you can if you wish).
* Re-import the dataset, this time with the corrected column types.
`r hide()`
```{r eval = TRUE}
spec(starwars)
corrected_cols <- cols(
name = col_character(),
height = col_double(),
mass = col_double(),
hair_color = col_factor(),
skin_color = col_factor(),
eye_color = col_factor(),
birth_year = col_double(),
sex = col_factor(),
gender = col_factor(),
homeworld = col_factor(),
species = col_factor(),
films = col_skip(),
vehicles = col_skip(),
starships = col_skip()
)
starwars <- read_csv("data/starwars.csv", col_types = corrected_cols)
```
`r unhide()`
### Plots {#sec-exercises-plots}
Produce the following plots and one plot of your own choosing. Write a brief summary of what each plot shows and any conclusions you might reach from the data.
```{r echo = FALSE, warning = FALSE}
ggplot(starwars, aes(height)) +
geom_histogram(binwidth = 25, colour = "black", alpha = .3) +
labs(title = "Height (cm) distribution of Star Wars Characters") +
theme_classic() +
scale_x_continuous(breaks = seq(from = 50, to = 300, by = 25))
```
```{r echo = FALSE, warning = FALSE}
ggplot(starwars, aes(height, mass)) +
geom_point() +
labs(title = "Mass (kg) by height (cm) distribution of Star Wars Characters") +
theme_classic() +
scale_x_continuous(breaks = seq(from = 0, to = 300, by = 50)) +
scale_y_continuous(breaks = seq(from = 0, to = 2000, by = 100)) +
coord_cartesian(xlim = c(0, 300))
```
```{r echo = FALSE}
ggplot(starwars, aes(x = gender, fill = gender)) +
geom_bar(show.legend = FALSE, colour = "black") +
scale_x_discrete(name = "Gender of character", labels = (c("Masculine", "Feminine", "Missing"))) +
scale_fill_brewer(palette = 2) +
theme_bw() +
labs(title = "Number of Star Wars characters of each gender")
```
`r hide()`
```{r eval = FALSE}
ggplot(starwars, aes(height)) +
geom_histogram(binwidth = 25, colour = "black", alpha = .3) +
scale_x_continuous(breaks = seq(from = 50, to = 300, by = 25)) +
labs(title = "Height (cm) distribution of Star Wars Characters") +
theme_classic()
```
```{r eval = FALSE}
ggplot(starwars, aes(height, mass)) +
geom_point() +
labs(title = "Mass (kg) by height (cm) distribution of Star Wars Characters") +
theme_classic() +
scale_x_continuous(breaks = seq(from = 0, to = 300, by = 50)) +
scale_y_continuous(breaks = seq(from = 0, to = 2000, by = 100)) +
coord_cartesian(xlim = c(0, 300))
```
```{r eval = FALSE}
ggplot(starwars, aes(x = gender, fill = gender)) +
geom_bar(show.legend = FALSE, colour = "black") +
scale_x_discrete(name = "Gender of character", labels = (c("Masculine", "Feminine", "Missing"))) +
scale_fill_brewer(palette = 2) +
labs(title = "Number of Star Wars characters of each gender") +
theme_bw()
```
`r unhide()`
### Make it look nice
* Add at least one Star Wars related image from an online source
* Hide the code and any messages from the knitted output
* Resize any images as you see fit
`r hide()`
```{r, eval = FALSE, verbatim='r, echo = FALSE, out.width = "50%", fig.cap="Adaptation of Star Wars logo created by Weweje; original logo by Suzy Rice, 1976. CC-BY-3.0"'}
knitr::include_graphics("https://upload.wikimedia.org/wikipedia/commons/thumb/c/ce/Star_wars2.svg/2880px-Star_wars2.svg.png")
```
```{r, out.width = "50%", echo = FALSE, fig.cap="Adaptation of Star Wars logo created by Weweje; original logo by Suzy Rice, 1976. CC-BY-3.0"}
knitr::include_graphics("https://upload.wikimedia.org/wikipedia/commons/thumb/c/ce/Star_wars2.svg/2880px-Star_wars2.svg.png")
```
`r unhide()`
### Share your work
Once you're done, share your knitted html file on the Week 4 Teams channel so other learners on the course can see how you approached the task.
```{r, include = FALSE}
# remove starwars.csv from the data folder so it doens't get inclued in the shared data. It sometimes gets created when authors run the code in this section
if (file.exists("data/starwars.csv"))
file.remove("data/starwars.csv")
```
## Glossary {#sec-glossary-data}
```{r, echo = FALSE, results='asis'}
glossary_table(as_kable = FALSE) |>
kableExtra::kable(row.names = FALSE, escape = FALSE) |>
unclass() |> cat()
```
## Further resources {#sec-resources-data}
* [Data import cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-import.pdf)
* [Chapter 11: Data Import](http://r4ds.had.co.nz/data-import.html) in *R for Data Science*
* [Multi-row headers](https://psyteachr.github.io/tutorials/multi-row-headers.html)
```{r, include = FALSE}
# clean up temp datasets
files <- c("data/avatar_na.csv", "data/family.csv")
file.exists(files) %>%
`[`(files, .) %>%
file.remove()
```