-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPackages.Rmd
112 lines (71 loc) · 8 KB
/
Packages.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
title: "Packages"
output:
html_document:
toc: TRUE
toc_float: TRUE
df_print: "paged"
---
So far, we have learned the basics of working with R. Over time, there have been many extensions to R's capabilities. Some of these have been added by the original developers of R, others have been added by ordinary R users. For some of the more specialized tasks in data analysis, basic R won't be enough and we will need to call on these extensions. An extension to R's basic capabilities is called a **package**. These packages first have to be installed, and then they can be loaded into our R session for us to use.
You can download and install a package from within RStudio. You don't need to go to a website or search for an installation file online. Go to the **Packages** tab in RStudio, and there you will see a list of all the extra packages you currently have installed, along with a short description and version number for each. To search for a package and then install it, click on **Install** and then type the name of the package you want. RStudio will autocomplete a package name once you start typing.
We only need to install a package once, and then we can use it as often as we like. However, when we want to use an installed package in an R script, we need to load the package first, and this is something that we need to do for each new script that uses the package. Fortunately, loading packages is easy. The `library()` function loads a package, and the input is the name of the package.
We should put any `library()` commands at or near the very top of our R script or markdown file, so that other people can see easily what packages they will need to have if they want to run our data analysis.
If successful, the `library()` function outputs nothing, so you won't see anything printed in the console. If you try to load a package that is not available, you will see an error message like this:
```{r, error=TRUE}
library(makeSignificant) # (just an example; this package does not actually exist)
```
In this case, you have either spelled the name of the package wrong, or you need to install it first.
# Example: MASS
As an example, we will load the `MASS` package. This is a package that usually comes pre-installed with R, so you may not even need to install it before loading it. Check in the RStudio packages tab if you are not sure.
```{r}
library(MASS)
```
Packages can bring with them many different things, but the main things we are interested in are functions and data. Any new functions or data that a package adds to R are available as normal once we have loaded the package.
The `MASS` package is a package that accompanies the textbook [*Modern Applied Statistics with S*](https://www.springer.com/gp/book/9780387954578) by Venables and Ripley (**S** is an earlier programming language on which R is based). The package includes many sets of data that are discussed in the textbook. One of these is the birth weights data set, which we saw briefly in the last tutorial. We loaded a tidied-up and abbreviated version of this data set from various file types. The `MASS` package makes these data available directly in R, under the name `birthwt`. To use them, we can assign them into a data frame.
```{r}
bw = birthwt
```
However, as we can see here, the column names in the original format are less informative, and some variables are stored as numeric when they should really be factors:
```{r}
bw
```
Packages also include help pages for the data and functions that they add. Try typing `? birthwt` in the console to see the help information for the birth weights data.
# Packages for loading data
There are various useful packages that provide functions for loading data from files generated by other programs. For example, the `readxl` package brings with it a function `read_excel()` that can read data from a Microsoft Excel file just as the basic R function `read.csv()` reads data from a csv file.
It is important not to confuse the name of the package with the names of the functions that the package adds. We first load the package, then use the function that we want from it.
```{r}
library(readxl)
bw = read_excel("data/birth_weights.xlsx")
```
The `read_excel()` function includes a useful additional argument `sheet` for specifying which sheet in an Excel file to read.
The `haven` package supplies a similar function for loading data from files generated by SPSS, a popular proprietary data analysis program.
```{r}
library(haven)
bw = read_spss("data/birth_weights.sav")
```
# ggplot
One package that we will make very extensive use of in the remaining tutorials is 'ggplot'. This package provides functions for creating graphs of data, an essential task in data analysis. Make sure that you have installed this package (or updated it to the latest version) before you follow the later tutorials.
The main function that ggplot provides is called `ggplot()`, but the name of the *package* that we need to install and then load is **ggplot2**.
We will learn in a later tutorial how to use ggplot, but as a quick preview, we can load it here and display a quick plot of the birth weights data. How these commands work will be explained later.
```{r}
library(ggplot2)
fig = ggplot(bw, aes(x=Weight, y=Birth_weight)) +
geom_point() +
labs(x='Mother\'s weight (kg)', y='Birth weight (kg)')
print(fig)
```
# Masking
As we have just seen, loading a package makes new functions available for use in our R session. Occasionally, two different packages will bring with them two different functions that have the same name. If we start working with multiple packages at once, we may occasionally have to load two such packages. In this case there is a conflict of names; R cannot tell apart the two same-named functions. Whichever package we load second will take precedence. Its function will overwrite, or 'mask', the earlier one with the same name.
R warns us when this has occurred. For example, the `psych` package contains a function `alpha()` that has the same name as a (completely different) function from ggplot.
```{r}
library(psych)
```
If we only want to use one of the two functions, then this doesn't matter. We just need to take care with the order in which we load packages, and load the package whose function we want to use *after* the package whose same-named function we don't want to use.
Alternatively, if we need to use both of the same-named functions in the same R script, we can still do so, by letting R know explicitly where to get the function from. The `::` symbol retrieves a function from a specific package. So if we want to use ggplot's `alpha()` we write `ggplot2::alpha()`, and if we want to use psych's `alpha()` we write `psych::alpha()`.
# Updating
If you want to know what packages are loaded in your current R session, you can look at the packages tab in RStudio and see which packages have a tick next to their names. Or you can use the `sessionInfo()` function in the console. Under **other attached packages** you see the names of packages that you have loaded, along with their version number.
```{r}
sessionInfo()
```
`sessionInfo()` also tells us our version of R and some other information about the way our system is set up. This information can be useful if we want to make our analysis completely transparent and reproducible. If we include the output of `sessionInfo()` in an appendix to our analysis, others can see exactly what version of R and certain packages we used, and make sure they use the same versions when checking our work. For most basic work this isn't strictly necessary, but it becomes more important for more elaborate analyses that use cutting-edge packages whose behavior may change slightly from one version to the next.
From time to time packages are updated and improved by the people who maintain them. Before embarking on a new project, we should check that we have the latest version. In the packages tab in RStudio there is an **Update** button, which will download and install new versions of installed packages if available. Alternatively, you can achieve this with the `update.packages()` function in the console.