2_01_data_frames.Rmd

# Data frames

## Introduction

We learned in the [A quick introduction to R] chapter that the word "variable" is used as short hand for any kind of named object. For example, we can make a variable called `num_vec` that refers to a simple numeric vector using:
```{r}
num_vec <- 1:10
```
When a computer scientist talks about variables they're usually referring to these sorts of name-value associations. However, the word "variable" has a second, more abstract meaning in the world of data analysis and statistics: it refers to anything we can control or measure. For example, if our data comes from an experiment, the data will typically involve variables whose values describe the experimental conditions (e.g. "control plots" vs. "fertiliser plots") and the quantities we chose to measure (e.g. species biomass and diversity). 

These kinds of abstract variables are often called "statistical variables". Statistical variables can be further broken down into a range of different types, such as numeric and categorical variables. We'll discuss these later on in the [Exploratory data analysis] chapter. The reason we're pointing out the dual meaning of the word "variable" here is because we need to work with both interpretations. The dual meaning can be confusing, but both meanings are in widespread use so we just have to get used to them. We'll try to minimise confusion by using the phrase "statistical variable" when we are referring to data, rather than R objects.

We're introducing these ideas now because we're going to consider a new type of data object in R: the __data frame__. Real world data analysis involves collections of data ("data sets") that involve several related statistical variables. We've seen that an atomic vector can only be used to store one type of data such as a collection of numbers. This means a vector can be used to store a single statistical variable, How should we keep a large collection of variables organised? We could work with them separately but this is very error prone. Ideally, we need a way to keep related variables together. This is the problem that __data frames__ are designed to manage.

## Data frames {#data-frames-intro}

Data frames are one of those R features that mark it out as a particularly good environment for data analysis. We can think of a data frame as table-like objects with rows and columns. They collect together different statistical variables, storing each of them as a different column. Related observations are all found in the same row. This will make more sense in a moment. Let's consider the columns first.

Each column is a vector of some kind. These are usually simple vectors containing numbers or character strings, though it is also possible to include more complicated vectors inside data frames. We'll only work with data frames made up of relatively simple vectors in this book. The key constraint that a data frame applies is that each vector must have the same length. This is what gives a data frame it table-like structure. 

The simplest way to get a feel for data frames is to make one. Data frames are usually constructed by reading some external data into R, but for the purposes of learning about them it is better to build one from its component parts. We'll make some artificial data describing a hypothetical experiment to do this. Imagine that we've conducted a small experiment to examine biomass and community diversity in six field plots. Three plots were subjected to fertiliser enrichment. The other three plots act as experimental controls. We could store the data describing this experiment in three vectors: 

-   `trt` (short for "treatment") shows which experimental manipulation was used.

-   `bms` (short for "biomass") shows the total biomass measured at the end of the experiment.

-   `div` (short for "diversity") shows the number of species present at the end of the experiment.

Here's some R code to generate these three vectors (it doesn't matter what the actual values are, they're made up):
```{r}
trt <- rep(c("Control","Fertilser"), each = 3) 
bms <- c(284, 328, 291, 956, 954, 685)
div <- c(8, 12, 11, 8, 4, 5)
```
```{r}
trt
bms
div
```
Notice that the information about different observations are linked by their positions in these vectors. For example, the third control plot had a biomass of '291' and a species diversity '11'.

We can use the `data.frame` function to construct a data frame from one or more vectors. To build a data frame from the three vectors we created and print these to the Console, we use:
```{r}
experim.data <- data.frame(trt, bms, div)
experim.data
```
Notice what happens when we print the data frame: it is displayed as though it has rows and columns. That's what we meant when we said a data frame is a table-like structure. The `data.frame` function takes a variable number of arguments. We used the `trt`, `bms` and `div` vectors as arguments, resulting in a data frame with three columns. Each of these vectors has 6 elements, so the resulting data frame has 6 rows. The names of the vectors were used to name its columns. The rows do not have names, but they are numbered to reflect their position.

The words `trt`, `bms` and `div` are not very informative. If we prefer to work with more informative column names---this is always a good idea---then we have to name the `data.frame` arguments:
```{r}
experim.data <- data.frame(Treatment = trt, Biomass = bms, Diversity = div)
experim.data
```
The new data frame contains the same data as the previous one but now the column names correspond to the names we chose. These names are better because they describe each variable using a human-readable word. 

```{block, type="warning"}
#### Don't bother with row names

We can also name the rows of a data frame using the `row.names` argument of the `data.frame` function. We won't bother to show an example of this though. Why? We can't easily work with the information in row names so there's not much point adding it. If we need to include row-specific data in a data frame it's best to include an additional variable, i.e. an extra column.
```

## Exploring data frames

The first things we should do when presented with a new data set is explore its structure to understand what we're dealing with. This is easy when the data is stored in a data frame. If the data set is reasonably small we can just print it to the Console. This is not very practical for even moderate-sized data sets though. The `head` and `tail` functions extract the first and last few rows of a data set, so these can be used to print part of a data set. The `n` argument controls the number of rows printed:
```{r}
head(experim.data, n = 3)
tail(experim.data, n = 3)
```

The `View` function can be used to visualise the whole data set in a spreadsheet like view:
```{r, eval=FALSE}
View(experim.data)
```
This shows the rows and columns of the data frame argument in a table- or spreadsheet-like format. When we run this in RStudio a new tab opens up with the `experim.data` data inside it. 

```{block, type=""}
#### `View` only displays the data

The `View` function is designed to allow us to display the data in a data frame as a table of rows and columns. We can't change the data in any way with the `View` function. We can reorder the way the data are presented, but keep in mind that this won't alter the underlying data. 
```

There are quite a few different R functions that will extract information about a data frame. The `nrow` and `ncol` functions return the number of rows and columns, respectively:
```{r}
nrow(experim.data)
ncol(experim.data)
```
The `names` function is used to extract the column names from a data frame:
```{r}
colnames(experim.data)
```
The `experim.data` data frame has three columns, so `names` returns a character vector of length three, where each element corresponds to a column name. There is also a `rownames` function if we need that too. The `nrow`, `ncol`, `names` and `rownames` functions each return a vector, so we can assign the result if we need to use it later. For example, if we want to extract and store the column names for any reason we could use `varnames <- names(experim.data)`. 

## Extracting data from data frames {#extract-data}

Data frames would not be much use if we could not extract and modify the data in them. In this section we will briefly review how to carry out these kinds of operations using basic R functions.

### Extracting and adding a single variable

A data frame is just a collection of variables stored in columns, where each column is a vector of some kind. There are several ways to extract these variables from a data frame. If we just want to extract a single variable we have two options.

The first way of extracting a variable from a data frame uses a double square brackets construct, `[[`. For example, we extract the `Biomass` variable from our example data frame with the double square brackets like this:
```{r}
experim.data[["Biomass"]]
``` 
This prints whatever is in the `Biomass` column to the Console. What kind of object is this? It's a numeric vector:
```{r}
is.numeric(experim.data[["Biomass"]])
```
A data frame really is nothing more than a collection of vectors. Notice that all we did was print the resulting vector to the Console. If we want to actually do something with this numeric vector we need to assign the result:
```{r}
bmass <- experim.data$Biomass
bmass^2
```
Here, we extracted the `Biomass` variable, assigned it to `bmass`, and then squared this. The value of `Biomass` variable inside the `experim.data` data frame is unchanged.

Notice that we used `"Biomass"` instead of `Biomass` inside the double square brackets, i.e.  we quoted the name of the variable. This is because we want R to treat the word "Biomass" as a literal value. This little detail is important! If we don't quote the name then R will assume that `Biomass` is the name of an object and go in search of it in the global environment. Since we haven't created something called `Biomass`, leaving out the quotes generates an error:
```{r, error=TRUE}
experim.data[[Biomass]]
``` 
The error message is telling us that R can't find a variable called `Biomass` in the global environment. On the other hand, this example does work:
```{r, eval=TRUE}
vname <- "Biomass"
experim.data[[vname]]
``` 
This works because we first defined `vname` to be a character vector of length one, whose value is the name of a variable in `experim.data`. When R encounters `vname` inside the `[[` construct it goes and finds the value associated with it and uses this value to determine the variable to extract.

The second method for extracting a variable from a data frame uses the `$` operator. For example, to extract the `Biomass` column from the `experim.data` data frame, we use:
```{r}
experim.data$Biomass
```
We use the `$` operator by placing the name of the data frame we want to work with on the left hand side and and the name of the column (i.e. the variable) we want to extract on the right hand side. Notice that this time we didn't have to put quotes around the variable name when using the `$` operator. We can do this if we want to---i.e. `experim.data$"Biomass"` also works---but `$` doesn't require it.

Why is there more than one way to extract variables from a data frame? There's no simple way to answer this question without getting into the details of how R represents data frames. The simple answer is that `$` and `[[` are not actually equivalent, even though they appear to do much the same thing. We've looked at the two extraction methods because they are both widely used. However, the `$` method is a bit easier to read and people tend to prefer it for interactive data analysis tasks (the `[[` construct tends to be used when we need a bit more flexibility).

### Adding a variable to a data frame

How do we add a new variable to an existing data frame? It turns out that the `$` operator is also be used for this job by combining it with the assignment operator. Using it this way is fairly intuitive. For example, if we want to add a new (made up) variable called `Elevation` to `experim.data`, we do it like this:
```{r}
experim.data$Elevation <- c(364, 294, 321, 358, 298, 312)
```
This assigns some fake elevation data to a new variable in `experim.data` using the `$` operator. The new variable is called `Elevation` because that was the name we used on the right hand side of `$`. This changes `experim.data`, such that it now contains four columns (variables):
```{r}
head(experim.data, n = 3)
```

The `[[` operator can also be used with ` <- ` to add variables to a data frame. We won't bother to show an example, as it works in exactly the same way as `$` and we won't be using the `[[` method in this book.

### Subsetting data frames

What do we do if, instead of just extracting a single variable from a data frame, we need to select a subset of rows and/or columns? We use the single square brackets construct, `[`, to do this. There are two different ways we can use single square brackets, both of which involve the use of indexing vector(s) inside the `[` construct.

The first use of `[` allows us to subset one or more columns while keeping all the rows. This works exactly as the `[` does for vectors. Just think of columns as the elements of the data frame. For example, if we want to subset `experim.data` such that we are only left with the first and second columns (`Treatment` and `Biomass`), we can use a numeric indexing vector:
```{r}
experim.data[c(1:2)]
```
However, this is not a very good way to subset columns because we have to know the position of each variable. If for some reason we change the order of the columns, we have to update our R code accordingly. A better approach uses a character vector of column names inside the `[`:
```{r}
experim.data[c("Treatment", "Biomass")]
```

The second use of `[ ]` is designed to allow us to subset rows and columns at the same time. We have to specify both the rows and the columns we require, using a comma ("`,`") to separate a row and column index vector. This is easiest to understand with an example:
```{r}
# row index
rindex <- 1:3
# column index 
cindex <- c("Treatment", "Biomass")
# subset the data frame
experim.data[rindex, cindex]
```
This example extracts a subset of `experim.data` corresponding to rows 1 through 3, and columns "Treatment" and "Biomass". The `rindex` is a numeric vector of row positions, and `cindex` is a character vector of column names. This shows that rows and columns can be selected by referencing their position or their names. The rows are not named in `experim.data`, so we specified the positions.

Storing the index vectors first is quite a long-winded way of subsetting a data frame. However, there is nothing to stop us doing everything in one step:
```{r}
experim.data[1:3, c("Treatment", "Biomass")]
```
If we need to subset just rows or columns we just leave out the appropriate index vector:
```{r}
experim.data[1:3, ]
```
The absence of an index vector before/after the comma indicates that we want to keep every row/column. Here we kept all the columns but only the first three rows.

```{block, type="warning"}
#### Be careful with `[`

Subsetting with the `[rindex, cindex]` construct produces another data frame. This should be apparent from the way the last example was printed. This is __usually__ how this construct works. We say usually, because subsetting just one column produces a vector. This is very unfortunate, as it produces unpredictable behaviour if we're not paying attention.
```

The `[` construct works with three types of index vectors. We've just seen that the index vector can be a numeric or character type. The third approach uses a logical index vector. For example, we can subset the `experim.data` data frame, keeping just the rows where the `Treatment` variable is equal to "Control", using:
```{r}
# make a logical index vector
rindex <- experim.data $ Treatment == "Control"
rindex
# 
experim.data[rindex, ]
```
Notice that we construct the logical `rindex` vector by extracting the `Treatment` variable with the `$` operator and using the `==` operator to test for equality with "Control". Don't worry too much if that seems confusing. We combined many different ideas in that example. We're going to learn a much more transparent way to achieve the same result in later chapters.

## Final words

We've seen how to extract/add variables and subset data frames using the `$`, `[[` and `[` constructs. The last example also showed that we can use a combination of relational operators (e.g. `==`, `!=` or `>=`) and the square brackets construct to subset a data frame according to one or more criteria.  There are also a number of base R functions that allow us to manipulate data frames in a slightly more intuitive way. For example, there is a function called `transform` that adds new variables and changes existing ones, and a function called `subset` to select variables and subset rows in a data frame according to the values of its variables. 

We've shown these approaches because they're still used by many people. However, we will rely on the **dplyr** package to handle operations like subsetting and transforming data frame variables in this book. The **dplyr** package provides a much cleaner, less error prone framework for manipulating data frames, and can be used to work with similar kinds of objects that store data in consistent way. Before we can do that though, we need to learn a little bit about how to organise and import data into R.