forked from PsyTeachR/ads-v1
-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathapp-datatypes.qmd
371 lines (263 loc) · 12.8 KB
/
app-datatypes.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
# Data Types {#sec-data-types}
## Basic data types
Data can be numbers, words, true/false values or combinations of these. The basic `r glossary("data type", "data types")` in R are: `r glossary("numeric")`, `r glossary("character")`, and `r glossary("logical")`, as well as the special classes of `r glossary("factor")` and date/times.
```{r excel-format-cells-e, echo = FALSE, fig.cap="Data types are like the categories when you format cells in Excel."}
include_graphics("images/appx/excel-format-cells.png")
```
### Numeric data
All of the numbers are `r glossary("numeric")` data types. There are two types of numeric data, `r glossary("integer")` and `r glossary("double")`. Integers are the whole numbers, like -1, 0 and 1. Doubles are numbers that can have fractional amounts. If you just type a plain number such as `10`, it is stored as a double, even if it doesn't have a decimal point. If you want it to be an exact integer, you can use the `L` suffix (10L), but this distinction doesn't make much difference in practice.
If you ever want to know the data type of something, use the `typeof` function.
```{r numeric-data}
typeof(10) # double
typeof(10.0) # double
typeof(10L) # integer
```
If you want to know if something is numeric (a double or an integer), you can use the function `is.numeric()` and it will tell you if it is numeric (`TRUE`) or not (`FALSE`).
```{r}
is.numeric(10L)
is.numeric(10.0)
is.numeric("Not a number")
```
### Character data
`r glossary("character", "Characters")` (also called "strings") are any text between quotation marks.
```{r character-data}
typeof("This is a character string")
typeof('You can use double or single quotes')
```
This can include quotes, but you have to `r glossary("escape")` quotes using a backslash to signal that the quote isn't meant to be the end of the string.
```{r quote}
my_string <- "The instructor said, \"R is cool,\" and the class agreed."
cat(my_string) # cat() prints the arguments
```
### Logical Data
`r glossary("logical", "Logical")` data (also sometimes called "boolean" values) is one of two values: true or false. In R, we always write them in uppercase: `TRUE` and `FALSE`.
```{r logical-data}
class(TRUE)
class(FALSE)
```
When you compare two values with an `r glossary("operator")`, such as checking to see if 10 is greater than 5, the resulting value is logical.
```{r logical-operator}
is.logical(10 > 5)
```
::: {.callout-note}
You might also see logical values abbreviated as `T` and `F`, or `0` and `1`. This can cause some problems down the road, so we will always spell out the whole thing.
:::
### Factors
A `r glossary("factor")` is a specific type of integer that lets you specify the categories and their order. This is useful in data tables to make plots display with categories in the correct order.
```{r}
myfactor <- factor("B", levels = c("A", "B","C"))
myfactor
```
Factors are a type of integer, but you can tell that they are factors by checking their `class()`.
```{r}
typeof(myfactor)
class(myfactor)
```
### Dates and Times
Dates and times are represented by doubles with special classes. Although `typeof()` will tell you they are a double, you can tell that they are dates by checking their `class()`. Datetimes can have one or more of a few classes that start with `POSIX`.
```{r}
date <- as.Date("2022-01-24")
datetime <- ISOdatetime(2022, 1, 24, 10, 35, 00, "GMT")
typeof(date)
typeof(datetime)
class(date)
class(datetime)
```
See @sec-dates-times for how to use <pkg>lubridate</pkg> to work with dates and times.
```{r, include = FALSE}
int <- c(answer = "integer", "double", "character", "logical", "factor")
dbl <- c("integer", answer = "double", "character", "logical", "factor")
chr <- c("integer", "double", answer = "character", "logical", "factor")
logi <- c("integer", "double", "character", answer = "logical", "factor")
fac <- c("integer", "double", "character", "logical", answer = "factor")
```
::: {.callout-note .try}
What data types are these:
* `100` `r mcq(dbl)`
* `100L` `r mcq(int)`
* `"100"` `r mcq(chr)`
* `100.0` `r mcq(dbl)`
* `-100L` `r mcq(int)`
* `factor(100)` `r mcq(fac)`
* `TRUE` `r mcq(logi)`
* `"TRUE"` `r mcq(chr)`
* `FALSE` `r mcq(logi)`
* `1 == 2` `r mcq(logi)`
:::
## Basic container types {#sec-containers}
Individual data values can be grouped together into containers. The main types of containers we'll work with are vectors, lists, and data tables.
### Vectors {#sec-vectors}
A `r glossary("vector")` in R is a set of items (or 'elements') in a specific order. All of the elements in a vector must be of the same **data type** (numeric, character, logical). You can create a vector by enclosing the elements in the function `c()`.
```{r vectors}
## put information into a vector using c(...)
c(1, 2, 3, 4)
c("this", "is", "cool")
1:6 # shortcut to make a vector of all integers x:y
```
::: {.callout-note .try}
What happens when you mix types? What class is the variable `mixed`?
```{r}
mixed <- c(2, "good", 2L, "b", TRUE)
```
```{r, webex.hide="Solution"}
typeof(mixed)
```
:::
::: {.callout-warning}
You can't mix data types in a vector; all elements of the vector must be the same data type. If you mix them, R will `r glossary("coercion", "coerce")` them so that they are all the same. If you mix doubles and integers, the integers will be changed to doubles. If you mix characters and numeric types, the numbers will be coerced to characters, so `10` would turn into `"10"`.
:::
#### Selecting values from a vector
If we wanted to pick specific values out of a vector by position, we can use square brackets (an `r glossary("extract operator")`, or `[]`) after the vector.
```{r vec_select}
values <- c(10, 20, 30, 40, 50)
values[2] # selects the second value
```
You can select more than one value from the vector by putting a vector of numbers inside the square brackets. For example, you can select the 18th, 19th, 20th, 21st, 4th, 9th and 15th letter from the built-in vector `LETTERS` (which gives all the uppercase letters in the Latin alphabet).
```{r vec_index}
word <- c(18, 19, 20, 21, 4, 9, 15)
LETTERS[word]
```
::: {.callout-note .try}
Can you decode the secret message?
```{r}
secret <- c(14, 5, 22, 5, 18, 7, 15, 14, 14, 1, 7, 9, 22, 5, 25, 15, 21, 21, 16)
```
```{r, webex.hide="Solution"}
LETTERS[secret]
```
:::
You can also create 'named' vectors, where each element has a name. For example:
```{r vec_named}
vec <- c(first = 77.9, second = -13.2, third = 100.1)
vec
```
We can then access elements by name using a character vector within the square brackets. We can put them in any order we want, and we can repeat elements:
```{r vec_named2}
vec[c("third", "second", "second")]
```
::: {.callout-note}
We can get the vector of names using the `names()` function, and we can set or change them using something like `names(vec2) <- c("n1", "n2", "n3")`.
:::
Another way to access elements is by using a logical vector within the square brackets. This will pull out the elements of the vector for which the corresponding element of the logical vector is `TRUE`. If the logical vector doesn't have the same length as the original, it will repeat. You can find out how long a vector is using the `length()` function.
```{r vec_len}
length(LETTERS)
LETTERS[c(TRUE, FALSE)]
```
#### Repeating Sequences {#sec-rep_seq}
Here are some useful tricks to save typing when creating vectors.
In the command `x:y` the `:` operator would give you the sequence of number starting at `x`, and going to `y` in increments of 1.
```{r colon}
1:10
15.3:20.5
0:-10
```
What if you want to create a sequence but with something other than integer steps? You can use the `seq()` function. Look at the examples below and work out what the arguments do.
```{r seq}
seq(from = -1, to = 1, by = 0.2)
seq(0, 100, length.out = 11)
seq(0, 10, along.with = LETTERS)
```
What if you want to repeat a vector many times? You could either type it out (painful) or use the `rep()` function, which can repeat vectors in different ways.
```{r rep1}
rep(0, 10) # ten zeroes
rep(c(1L, 3L), times = 7) # alternating 1 and 3, 7 times
rep(c("A", "B", "C"), each = 2) # A to C, 2 times each
```
The `rep()` function is useful to create a vector of logical values (`TRUE`/`FALSE` or `1`/`0`) to select values from another vector.
```{r eiko}
# Get IDs in the pattern Y Y N N ...
ids <- 1:40
yynn <- rep(c(TRUE, FALSE), each = 2,
length.out = length(ids))
ids[yynn]
```
### Lists
Recall that vectors can contain data of only one type. What if you want to store a collection of data of different data types? For that purpose you would use a `r glossary("list")`. Define a list using the `list()` function.
```{r list-define}
data_types <- list(
double = 10.0,
integer = 10L,
character = "10",
logical = TRUE
)
str(data_types) # str() prints lists in a condensed format
```
You can refer to elements of a list using square brackets like a vector, but you can also use the dollar sign notation (`$`) if the list items have names.
```{r}
data_types$logical
```
::: {.callout-note .try}
Explore the 5 ways shown below to extract a value from a list. What data type is each object? What is the difference between the single and double brackets? Which one is the same as the dollar sign?
```{r}
bracket1 <- data_types[1]
bracket2 <- data_types[[1]]
name1 <- data_types["double"]
name2 <- data_types[["double"]]
dollar <- data_types$double
```
:::
The single brackets (`bracket1` and `name1`) return a list with the subset of items inside the brackets. In this case, that's just one item, but can be more (try `data_types[1:2]`). The items keep their names if they have them, so the returned value is `list(double = 10)`.
The double brackets (`bracket2` and `name2` return a single item as a vector. You can't select more than one item; `data_types[[1:2]]` will give you a "subscript out of bounds" error.
The dollar-sign notation is the same as double-brackets. If the name has spaces or any characters other than letters, numbers, underscores, and full stops, you need to surround the name with backticks (e.g., `` sales$`Customer ID` ``).
### Tables {#sec-tables-data}
`r glossary("Tabular data")` structures allow for a collection of data of different types (characters, integers, logical, etc.) but subject to the constraint that each "column" of the table (element of the list) must have the same number of elements. The base R version of a table is called a `data.frame`, while the 'tidyverse' version is called a `tibble`. Tibbles are far easier to work with, so we'll be using those. To learn more about differences between these two data structures, see `vignette("tibble")`.
```{r avatar}
# construct a table by column with tibble
avatar <- tibble(
name = c("Katara", "Toph", "Sokka"),
bends = c("water", "earth", NA),
friendly = TRUE
)
# or by row with tribble
avatar <- tribble(
~name, ~bends, ~friendly,
"Katara", "water", TRUE,
"Toph", "earth", TRUE,
"Sokka", NA, TRUE
)
```
```{r, eval = FALSE}
# export the data to a file
rio::export(avatar, "data/avatar.csv")
# or by importing data from a file
avatar <- rio::import("data/avatar.csv")
```
Tabular data becomes especially important for when we talk about `r glossary("tidy data")` in @sec-tidy, which consists of a set of simple principles for structuring data.
#### Table info
We can get information about the table using the following functions.
* `ncol()`: number of columns
* `nrow()`: number of rows
* `dim()`: the number of rows and number of columns
* `name()`: the column names
* `glimpse()`: the column types
```{r}
nrow(avatar)
ncol(avatar)
dim(avatar)
names(avatar)
glimpse(avatar)
```
#### Accessing rows and columns {#sec-row-col-access}
There are various ways of accessing specific columns or rows from a table. You'll be learning more about this in @sec-tidy and @sec-wrangle.
```{r dataframe-access-tidyverse}
siblings <- avatar %>% slice(1, 3) # rows (by number)
bends <- avatar %>% pull(2) # column vector (by number)
friendly <- avatar %>% pull(friendly) # column vector (by name)
bends_name <- avatar %>% select(bends, name) # subset table (by name)
toph <- avatar %>% pull(name) %>% pluck(2) # single cell
```
The code below uses `r glossary("base R")` to produce the same subsets as the functions above. This format is useful to know about, since you might see them in other people's scripts.
```{r dataframe-access-base}
# base R access
siblings <- avatar[c(1, 3), ] # rows (by number)
bends <- avatar[, 2] # column vector (by number)
friendly <- avatar$friendly # column vector (by name)
bends_name <- avatar[, c("bends", "name")] # subset table (by name)
toph <- avatar[[2, 1]] # single cell (row, col)
```
## Glossary {#sec-glossary-datatypes}
```{r, echo = FALSE, results='asis'}
glossary_table(as_kable = FALSE) |>
kableExtra::kable(row.names = FALSE, escape = FALSE) |>
unclass() |> cat()
```