app-datatypes.qmd

# Data Types {#sec-data-types}

## Basic data types 

Data can be numbers, words, true/false values or combinations of these. The basic `r glossary("data type", "data types")` in R are: `r glossary("numeric")`, `r glossary("character")`, and `r glossary("logical")`, as well as the special classes of `r glossary("factor")` and date/times.

```{r excel-format-cells-e, echo = FALSE, fig.cap="Data types are like the categories when you format cells in Excel."}
include_graphics("images/appx/excel-format-cells.png")
```


### Numeric data

All of the numbers are `r glossary("numeric")` data types. There are two types of numeric data, `r glossary("integer")` and `r glossary("double")`. Integers are the whole numbers, like -1, 0 and 1. Doubles are numbers that can have fractional amounts. If you just type a plain number such as `10`, it is stored as a double, even if it doesn't have a decimal point. If you want it to be an exact integer, you can use the `L` suffix (10L), but this distinction doesn't make much difference in practice.

If you ever want to know the data type of something, use the `typeof` function.

```{r numeric-data}
typeof(10)   # double
typeof(10.0) # double
typeof(10L)  # integer
```

If you want to know if something is numeric (a double or an integer), you can use the function `is.numeric()` and it will tell you if it is numeric (`TRUE`) or not (`FALSE`).

```{r}
is.numeric(10L)
is.numeric(10.0)
is.numeric("Not a number")
```

### Character data

`r glossary("character", "Characters")` (also called "strings") are any text between quotation marks. 

```{r character-data}
typeof("This is a character string")
typeof('You can use double or single quotes')
```

This can include quotes, but you have to `r glossary("escape")` quotes using a backslash to signal that the quote isn't meant to be the end of the string.

```{r quote}
my_string <- "The instructor said, \"R is cool,\" and the class agreed."
cat(my_string) # cat() prints the arguments
```

### Logical Data

`r glossary("logical", "Logical")` data (also sometimes called "boolean" values) is one of two values: true or false. In R, we always write them in uppercase: `TRUE` and `FALSE`.

```{r logical-data}
class(TRUE)
class(FALSE)
```

When you compare two values with an `r glossary("operator")`, such as checking to see if 10 is greater than 5, the resulting value is logical.

```{r logical-operator}
is.logical(10 > 5)
```

::: {.callout-note}
You might also see logical values abbreviated as `T` and `F`, or `0` and `1`. This can cause some problems down the road, so we will always spell out the whole thing.
:::

### Factors

A `r glossary("factor")` is a specific type of integer that lets you specify the categories and their order. This is useful in data tables to make plots display with categories in the correct order.

```{r}
myfactor <- factor("B", levels = c("A", "B","C"))
myfactor
```

Factors are a type of integer, but you can tell that they are factors by checking their `class()`.

```{r}
typeof(myfactor)
class(myfactor)
```

### Dates and Times

Dates and times are represented by doubles with special classes. Although `typeof()` will tell you they are a double, you can tell that they are dates by checking their `class()`. Datetimes can have one or more of a few classes that start with `POSIX`.

```{r}
date <- as.Date("2022-01-24")
datetime <- ISOdatetime(2022, 1, 24, 10, 35, 00, "GMT")
typeof(date)
typeof(datetime)
class(date)
class(datetime)
```

See @sec-dates-times for how to use <pkg>lubridate</pkg> to work with dates and times.


```{r, include = FALSE}
int <- c(answer = "integer", "double", "character", "logical", "factor")
dbl <- c("integer", answer = "double", "character", "logical", "factor")
chr <- c("integer", "double", answer = "character", "logical", "factor")
logi <- c("integer", "double", "character", answer = "logical", "factor")
fac <- c("integer", "double", "character", "logical", answer = "factor")
```

::: {.callout-note .try}
What data types are these:

* `100` `r mcq(dbl)`
* `100L` `r mcq(int)`
* `"100"` `r mcq(chr)`
* `100.0` `r mcq(dbl)`
* `-100L` `r mcq(int)`
* `factor(100)` `r mcq(fac)`
* `TRUE` `r mcq(logi)`
* `"TRUE"` `r mcq(chr)`
* `FALSE` `r mcq(logi)`
* `1 == 2` `r mcq(logi)`

:::


## Basic container types {#sec-containers}

Individual data values can be grouped together into containers. The main types of containers we'll work with are vectors, lists, and data tables.

### Vectors {#sec-vectors}

A `r glossary("vector")` in R is a set of items (or 'elements') in a specific order. All of the elements in a vector must be of the same **data type** (numeric, character, logical). You can create a vector by enclosing the elements in the function `c()`.

```{r vectors}
## put information into a vector using c(...)
c(1, 2, 3, 4)
c("this", "is", "cool")
1:6 # shortcut to make a vector of all integers x:y
```

::: {.callout-note .try}
What happens when you mix types? What class is the variable `mixed`?
```{r}
mixed <- c(2, "good", 2L, "b", TRUE)
```

```{r, webex.hide="Solution"}
typeof(mixed)
```

:::

::: {.callout-warning}
You can't mix data types in a vector; all elements of the vector must be the same data type. If you mix them, R will `r glossary("coercion", "coerce")` them so that they are all the same. If you mix doubles and integers, the integers will be changed to doubles. If you mix characters and numeric types, the numbers will be coerced to characters, so `10` would turn into `"10"`.
:::

#### Selecting values from a vector

If we wanted to pick specific values out of a vector by position, we can use square brackets (an `r glossary("extract operator")`, or `[]`) after the vector.

```{r vec_select}
values <- c(10, 20, 30, 40, 50)
values[2] # selects the second value
```

You can select more than one value from the vector by putting a vector of numbers inside the square brackets. For example, you can select the 18th, 19th, 20th, 21st, 4th, 9th and 15th letter from the built-in vector `LETTERS` (which gives all the uppercase letters in the Latin alphabet).

```{r vec_index}
word <- c(18, 19, 20, 21, 4, 9, 15)
LETTERS[word]
```

::: {.callout-note .try}

Can you decode the secret message?
```{r}
secret <- c(14, 5, 22, 5, 18, 7, 15, 14, 14, 1, 7, 9, 22, 5, 25, 15, 21, 21, 16)
```

```{r, webex.hide="Solution"}
LETTERS[secret]
```
:::

You can also create 'named' vectors, where each element has a name. For example:

```{r vec_named}
vec <- c(first = 77.9, second = -13.2, third = 100.1)
vec
```

We can then access elements by name using a character vector within the square brackets. We can put them in any order we want, and we can repeat elements:

```{r vec_named2}
vec[c("third", "second", "second")]
```

::: {.callout-note}
We can get the vector of names using the `names()` function, and we can set or change them using something like `names(vec2) <- c("n1", "n2", "n3")`.
:::

Another way to access elements is by using a logical vector within the square brackets. This will pull out the elements of the vector for which the corresponding element of the logical vector is `TRUE`. If the logical vector doesn't have the same length as the original, it will repeat. You can find out how long a vector is using the `length()` function.

```{r vec_len}
length(LETTERS)
LETTERS[c(TRUE, FALSE)]
```

#### Repeating Sequences {#sec-rep_seq}

Here are some useful tricks to save typing when creating vectors.

In the command `x:y` the `:` operator would give you the sequence of number starting at `x`, and going to `y` in increments of 1. 

```{r colon}
1:10
15.3:20.5
0:-10
```

What if you want to create a sequence but with something other than integer steps? You can use the `seq()` function. Look at the examples below and work out what the arguments do.

```{r seq}
seq(from = -1, to = 1, by = 0.2)
seq(0, 100, length.out = 11)
seq(0, 10, along.with = LETTERS)
```

What if you want to repeat a vector many times? You could either type it out (painful) or use the `rep()` function, which can repeat vectors in different ways.

```{r rep1}
rep(0, 10)                      # ten zeroes
rep(c(1L, 3L), times = 7)       # alternating 1 and 3, 7 times
rep(c("A", "B", "C"), each = 2) # A to C, 2 times each
```

The `rep()` function is useful to create a vector of logical values (`TRUE`/`FALSE` or `1`/`0`) to select values from another vector.

```{r eiko}
# Get IDs in the pattern Y Y N N ...
ids <- 1:40
yynn <- rep(c(TRUE, FALSE), each = 2, 
            length.out = length(ids))
ids[yynn]
```


### Lists

Recall that vectors can contain data of only one type. What if you want to store a collection of data of different data types? For that purpose you would use a `r glossary("list")`. Define a list using the `list()` function.

```{r list-define}   

data_types <- list(
  double = 10.0,
  integer = 10L,
  character = "10",
  logical = TRUE
)

str(data_types) # str() prints lists in a condensed format
```

You can refer to elements of a list using square brackets like a vector, but you can also use the dollar sign notation (`$`) if the list items have names.

```{r}
data_types$logical
```

::: {.callout-note .try}

Explore the 5 ways shown below to extract a value from a list. What data type is each object? What is the difference between the single and double brackets? Which one is the same as the dollar sign?

```{r}
bracket1 <- data_types[1]
bracket2 <- data_types[[1]]
name1    <- data_types["double"]
name2    <- data_types[["double"]]
dollar   <- data_types$double
```

:::

The single brackets (`bracket1` and `name1`) return a list with the subset of items inside the brackets. In this case, that's just one item, but can be more (try `data_types[1:2]`). The items keep their names if they have them, so the returned value is `list(double = 10)`.

The double brackets (`bracket2` and `name2` return a single item as a vector. You can't select more than one item; `data_types[[1:2]]` will give you a "subscript out of bounds" error. 

The dollar-sign notation is the same as double-brackets. If the name has spaces or any characters other than letters, numbers, underscores, and full stops, you need to surround the name with backticks (e.g., `` sales$`Customer ID` ``).


### Tables {#sec-tables-data}

`r glossary("Tabular data")` structures allow for a collection of data of different types (characters, integers, logical, etc.) but subject to the constraint that each "column" of the table (element of the list) must have the same number of elements. The base R version of a table is called a `data.frame`, while the 'tidyverse' version is called a `tibble`.  Tibbles are far easier to work with, so we'll be using those. To learn more about differences between these two data structures, see `vignette("tibble")`.

```{r avatar}
# construct a table by column with tibble
avatar <- tibble(
  name = c("Katara", "Toph", "Sokka"),
  bends = c("water", "earth", NA),
  friendly = TRUE
)

# or by row with tribble
avatar <- tribble(
  ~name,    ~bends,  ~friendly,
  "Katara", "water", TRUE,
  "Toph",   "earth", TRUE,
  "Sokka",  NA,      TRUE
)
```

```{r, eval = FALSE}
# export the data to a file
rio::export(avatar, "data/avatar.csv")

# or by importing data from a file
avatar <- rio::import("data/avatar.csv")
```

Tabular data becomes especially important for when we talk about `r glossary("tidy data")` in @sec-tidy, which consists of a set of simple principles for structuring data.

#### Table info

We can get information about the table using the following functions.

* `ncol()`: number of columns
* `nrow()`: number of rows
* `dim()`: the number of rows and number of columns 
* `name()`: the column names
* `glimpse()`: the column types

```{r}
nrow(avatar)
ncol(avatar)
dim(avatar)
names(avatar)
glimpse(avatar)
```

#### Accessing rows and columns {#sec-row-col-access}

There are various ways of accessing specific columns or rows from a table. You'll be learning more about this in @sec-tidy and @sec-wrangle. 

```{r dataframe-access-tidyverse}
siblings   <- avatar %>% slice(1, 3) # rows (by number)
bends      <- avatar %>% pull(2) # column vector (by number)
friendly   <- avatar %>% pull(friendly) # column vector (by name)
bends_name <- avatar %>% select(bends, name) # subset table (by name)
toph       <- avatar %>% pull(name) %>% pluck(2) # single cell
``` 

The code below uses `r glossary("base R")` to produce the same subsets as the functions above. This format is useful to know about, since you might see them in other people's scripts.

```{r dataframe-access-base}
# base R access

siblings   <- avatar[c(1, 3), ] # rows (by number)
bends      <- avatar[, 2] # column vector (by number)
friendly   <- avatar$friendly  # column vector (by name)
bends_name <- avatar[, c("bends", "name")] # subset table (by name)
toph       <- avatar[[2, 1]] # single cell (row, col)
```

## Glossary {#sec-glossary-datatypes}

```{r, echo = FALSE, results='asis'}
glossary_table(as_kable = FALSE) |> 
  kableExtra::kable(row.names = FALSE, escape = FALSE) |>
  unclass() |> cat()
```