generated from rstudio/bookdown-demo
-
Notifications
You must be signed in to change notification settings - Fork 17
/
Copy path13-visualization.Rmd
271 lines (178 loc) · 19.5 KB
/
13-visualization.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
# Visualization
I describe a few plotting paradigms in R and Python below. Note that these descriptions are brief. More details could easily turn any of these subsections into an entire textbook.
## Base R Plotting
R comes with some built-in plotting functions such as `plot()`, `hist()` and `boxplot()`. Many of these reside in `package:graphics`, which comes pre-loaded into the search path. `plot()` on the other hand, is higher up the search path in `package:base`--it is a generic method whose methods can reside in many places (e.g. in `package:graphics` or some place else).
Base plotting will usually cover most of your needs, so that is what we spend the most time learning about. However, there are a large number of third-party libraries for plotting that you might consider looking into if you want to use a certain aesthetic, or if you want plotting functionality that is specialized for certain cases (e.g. geospatial plots).
Recall our Albemarle Real Estate data set [@albemarle_county_gis_web] [@clay_ford].
```{r, collapse = TRUE, R.options = list(width = 65)}
df <- read.csv("data/albemarle_real_estate.csv")
str(df, strict.width = "cut")
```
If we wanted to get a general idea of how expensive homes were in Albemarle County, we could use a histogram. This helps us visualize a univariate numerical variable/column. Below I plot the (natural) logarithm of home prices.
```{r, collapse=TRUE, fig.cap="A Simple Histogram", fig.align='center', out.width='80%'}
hist(log(df$TotalValue),
xlab = "natural logarithm of home price",
main = "Super-Duper Plot!")
```
I specified the `xlab=` and `main=` arguments, but there are many more that could be tweaked. Make sure to skim the options in the documentation (`?hist`).
`plot()` is useful for plotting *two* univariate numerical variables. This can be done in time series plots (variable versus time) and scatter plots (one variable versus another).
```{r, collapse=T, fig.cap = "Some Scatterplots", fig.align="center", out.width='80%'}
par(mfrow=c(1,2))
plot(df$TotalValue, df$LotSize,
xlab = "total value ($)", ylab = "lot size (sq. ft.)",
pch = 3, col = "red", type = "b")
plot(log(df$TotalValue), log(df$LotSize),
xlab = "log. total value", ylab = "log. lot size",
pch = 2, col = "blue", type = "p")
abline(h = log(mean(df$LotSize)), col = "green")
par(mfrow=c(1,1))
```
I use some of the many arguments available (type `?plot`). `xlab=` and `ylab=` specify the x- and y-axis labels, respectively. `col=` is short for "color." `pch=` is short for "point character." Changing this will change the symbol shapes used for each point. `type=` is more general than that, but it is related. I typically use it to specify whether or not I want the points connected with lines.
I use a couple other functions in the above code. `abline()` is used to superimpose lines over the top of a plot. They can be `h`orizontal, `v`ertical, or you can specify them in slope-intercept form, or by providing a linear model object. I also used `par()` to set a graphical parameter. The graphical parameter `par()$mfrow` sets the layout of a multiple plot visualization. I then set it back to the standard $1 \times 1$ layout afterwards.
## Plotting with `ggplot2`
\index{ggplot2}[`ggplot2`](https://ggplot2.tidyverse.org/index.html) is a popular third-party visualization package for R.\index{ggplot2} There are also libraries in Python (e.g. [`plotnine`](https://plotnine.readthedocs.io/en/stable/#)) that have a similar look and feel. This subsection provides a short tutorial on how to use `ggplot2` in R, and it is primarily based off of the material provided in [@ggplot2]. Other excellent descriptions of `ggplot2` are [@r_in_action] and [@r_graphics_cookbook].
`ggplot2` code looks a lot different than the code in the above section^[Personally, I find its syntax more confusing, and so I tend to prefer base graphics. However, it is very popular, and so I do believe that it is important to mention it here in this text.]. There, we would write a series of function calls, and each would change some state in the current figure. Here, we call different `ggplot2` functions that create S3 objects with special behavior (more information about S3 objects in subsection \@ref(using-s3-objects)), and then we "add" (i.e. we use the `+` operator) them together.
This new design is not to encourage you to think about S3 object-oriented systems. Rather, it is to get you thinking about making visualizations using the "grammar of graphics" [@gog]. `ggplot2` makes use of its own specialized vocabulary that is taken from this book. As we get started, I will try to introduce some of this vocabulary slowly.
The core function in this library is the [`ggplot()` function](https://www.rdocumentation.org/packages/ggplot2/versions/3.3.5/topics/ggplot). This function initializes figures; it is the function that will take in information about *which* data set you want to plot, and *how* you want to plot it. The raw data is provided in the first argument. The second argument, `mapping=`, is more confusing. The argument should be constructed with the `aes()` function. In the parlance of `ggplot2`, `aes()` constructs an **aesthetic mapping**. Think of the "aesthetic mapping" as stored information that can be used later on--it "maps" data to visual properties of a figure.
Consider this first example by typing the following into your own console.
```{r, collapse = TRUE, eval = FALSE}
library(ggplot2)
ggplot(mpg, aes(x = displ, y = hwy))
```
You'll notice a few things about the code and the result produced:
1. No geometric shapes show up!
2. A Cartesian coordinate system is displayed, and the x-axis and y-axis were created based on aesthetic mapping provided (confirm this by typing `summary(mpg$displ)` and `summary(mpg$hwy)`).
3. The axis labels are taken from the column names provided to `aes()`.
To display geometric shapes (aka *geoms* in the parlance of `ggplot2`), we need to add [**layers**](https://ggplot2-book.org/toolbox.html#toolbox) to the figure. "Layers" is quite a broad term--it does not only apply to geometric objects. In fact, in `ggplot2`, a layer can be pretty much anything: raw data, summarized data, transformed data, annotations, etc. However, the functions that add geometric object layers usually start with the prefix `geom_`. In RStudio, after loading `ggplot2`, type `geom_`, and then press `<Tab>` (autocomplete) to see some of the options.
Consider the function [`geom_point()`](https://www.rdocumentation.org/packages/ggplot2/versions/3.3.5/topics/geom_point). It too returns an S3 instance that has specialized behavior. In the parlance of `ggplot2`, it adds a [scatterplot](https://ggplot2-book.org/getting-started.html#basic-use) layer to the figure.
```{r, collapse = TRUE, fig.cap = "A Second Scatterplot", fig.align='center', out.width='80%', collapse = TRUE}
library(ggplot2)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
```
Notice that we did not need to provide any arguments to `geom_point()`--the aesthetic mappings were used by the new layer.
There are *many* types of layers that you can add, and you are not limited to any number of them in a given plot. For example, if we wanted to add a title, we could use the `ggtitle()` function to add a title layer. Unlike `geom_point()`, this function will need to take an argument because the desired title is not stored as an aesthetic mapping. Try running the following code on your own machine.
```{r, collapse = TRUE, eval=FALSE}
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
ggtitle("my favorite scatterplot")
```
Additionally, notice that the same layer will behave much differently if we change the aesthetic mapping.
```{r, collapse = TRUE, fig.cap = "Adding Some Color", fig.align='center', out.width='80%', collapse = TRUE}
ggplot(mpg, aes(x = displ, y = hwy, color = manufacturer)) +
geom_point() +
ggtitle("my favorite scatterplot")
```
If we want tighter control on the aesthetic mapping, we can use [**scales**](https://ggplot2-book.org/scales.html#scales). Syntactically, these are things we "add" (`+`) to the figure, just like layers. However, these scales are constructed with a different set of functions, many of which start with the prefix `scale_`. We can change attributes of the axes like this.
```{r, collapse = TRUE, fig.cap = "Changing Scales", fig.align='center', out.width='80%', collapse = TRUE}
base_plot <- ggplot(mpg,
aes(x = displ, y = hwy, color = manufacturer)) +
geom_point() +
ggtitle("my favorite scatterplot")
base_plot + scale_x_log10() + scale_y_reverse()
```
We can also change plot colors with scale layers. Let's add an aesthetic called `fill=` so we can use colors to denote the value of a numerical (not categorical) column. This data set doesn't have any more unused numerical columns, so let's create a new one called `score`. We also use a new geom layer from a function called `geom_tile()`.
```{r, collapse = TRUE, fig.cap = "Changing the Fill", fig.align='center', out.width='80%'}
mpg$score <- 1/(mpg$displ^2 + mpg$hwy^2)
ggplot(mpg, aes(x = displ, y = hwy, fill = score )) +
geom_tile()
```
If we didn't like these colors, we could change them with a scale layer. Personally, I like this one.
```{r, collapse = TRUE, fig.cap = "Changing the Fill Again", fig.align='center', out.width='80%', collapse = TRUE}
mpg$score <- 1/(mpg$displ^2 + mpg$hwy^2)
ggplot(mpg, aes(x = displ, y = hwy, fill = score )) +
geom_tile() +
scale_fill_viridis_b()
```
There are many to choose from, though. Try to run the following code on your own to see what it produces.
```{r, collapse = TRUE, eval = FALSE}
mpg$score <- 1/(mpg$displ^2 + mpg$hwy^2)
ggplot(mpg, aes(x = displ, y = hwy, fill = score )) +
geom_tile() +
scale_fill_gradient2()
```
## Plotting with Matplotlib
Matplotlib\index{Matplotlib} [@Hunter:2007] is a third-party visualization library in Python.\index{Matplotlib} It is the oldest and most heavily-used, so it is the best way to start making graphics in Python, in my humble opinion. It also comes installed with Anaconda.
This short introduction borrows heavily from the myriad of [tutorials](https://matplotlib.org/stable/tutorials/index.html) on Matplotlib's website. I will start off making a simple plot, and commenting on each line of code. If you're interested in learning more, [@py_ds_handbook] and [@pandas_guy] are also terrific resources.
::: {.rmd-details data-latex=""}
You can use either "pyplot-style" (e.g. `plt.plot()`) or "object-oriented-style" to make figures in Matplotlib. Even though using the first type is faster to make simple plots, I will only describe the second one. It is the recommended approach because it is more extensible. However, the first one resembles the syntax of MATLAB. If you're familiar with MATLAB, you might consider learning a little about the first style, as well.
:::
```{python, collapse = TRUE, fig.cap = "Another Simple Histogram", out.width='80%', fig.align='center', collapse = TRUE}
import matplotlib.pyplot as plt # 1
import numpy as np # 2
fig, ax = plt.subplots() # 3
_ = ax.hist(np.random.normal(size=1000)) # 4
plt.show() # 5
```
In the first line, we import the `pyplot` submodule of `matplotlib`. We rename it to `plt`, which is short, and will save us some typing. Calling it `plt` follows the most popular naming convention.
Second, we import Numpy in the same way we always have. Matplotlib is written to work with Numpy arrays. If you want to plot some data, and it isn't in a Numpy array, you should convert it first.
Third, we call the `subplots()` function, and use *sequence unpacking* to unpack the returned container into individual objects without storing the overall container. "Subplots" sounds like it will make many different plots all on one figure, but if you look at the [documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html#matplotlib-pyplot-subplots) the number of rows and columns defaults to one and one, respectively.
`plt.subplots()` returns a [`tuple`](https://docs.python.org/3.3/library/stdtypes.html?highlight=tuple#tuple)^[We didn't talk about `tuple`s in chapter 2, but you can think of them as being similar to `list`s. They are containers that can hold elements of different types. There are a few key differences, though: they are made with parentheses (e.g. `('a')` ) instead of square brackets, and they are immutable instead of mutable.]\index{tuples in Python} of two things: a `Figure` object, and one or more `Axes` object(s). These two classes will require some explanation.
1. A [`Figure` object](https://matplotlib.org/stable/api/figure_api.html#matplotlib.figure.Figure) is the overall visualization object you're making. It holds onto all of the plot elements. If you want to save all of your progress (e.g. with `fig.savefig('my_picture.png')`), you're saving the overall `Figure` object.
2. One or more [`Axes` objects](https://matplotlib.org/stable/api/axes_api.html#the-axes-class) are contained in a `Figure` object. Each is "[what you think of as 'a plot'](https://matplotlib.org/stable/tutorials/introductory/usage.html#axes)." They hold onto two `Axis` objects (in the case of 2-dimensional plots) or three (in the case of 3-dimensional arguments). We are usually calling the methods of these objects to effect changes on a plot.
In line four, we call the [`hist()` method](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.hist.html#matplotlib.axes.Axes.hist) of the `Axes` object called `ax`. We assign the output of `.hist()` to a variable `_`. This is done to suppress the printing of the method's output, and because this variable name is a Python convention that signals the object is temporary and will not be used later in the program. There are many more plots available than plain histograms. Each one has its own method, and you can peruse the options in [the documentation](https://matplotlib.org/stable/api/axes_api.html#plotting).
If you want to make figures that are more elaborate, just keep calling different methods of `ax`. If you want to fit more subplots to the same figure, add more `Axes` objects. Here is an example using some code from [one of the official Matplotlib tutorials](https://matplotlib.org/stable/tutorials/introductory/usage.html#the-object-oriented-interface-and-the-pyplot-interface).
```{python, collapse = TRUE, out.width='80%', fig.cap = "Side-By-Side Line Plots in Matplotlib", fig.align='center'}
# x values grid shared by both subplots
x = np.linspace(0, 2*np.pi, 100)
# create two subplots...one row two columns
fig, myAxes = plt.subplots(1, 2) # kind of like par(mfrow=c(1,2)) in R
# first subplot
myAxes[0].plot(x, x, label='linear') # Plot some data on the axes.
myAxes[0].plot(x, x**2, label='quadratic') # Plot more data
myAxes[0].plot(x, x**3, label='cubic') # ... and some more.
myAxes[0].set_xlabel('x label') # Add an x-label to the axes.
myAxes[0].set_ylabel('y label') # Add a y-label to the axes.
myAxes[0].set_title("Simple Plot") # Add a title to the axes.
myAxes[0].legend() # Add a legend.
# second subplot
myAxes[1].plot(x,np.sin(x), label='sine wave')
myAxes[1].legend()
plt.show()
```
## Plotting with Pandas
Pandas\index{Pandas} provides several `DataFrame` and `Series` methods that perform plotting. These methods are mostly just wrapper functions around Matplotlib code. They are written for convenience, so generally speaking, plotting can be done more quickly with Pandas compared to Matplotlib. Here I describe a few options that are available, and [the documentation](https://pandas.pydata.org/docs/user_guide/visualization.html#chart-visualization) provides many more details for the curious reader.
The [`.plot()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html) method is very all-encompassing because it allows you to choose between many different plot types: a line plot, horizontal and vertical bar plots, histograms, boxplots, density plots, area plots, pie charts, scatterplots and hexbin plots. If you only want to remember one function name for plotting in Pandas, this is it.
If you already have Pandas `import`ed, you can make good-looking plots in just one line of code. The default plot type for `.plot()` is a line plot, so there tends to be less typing if you're working with time series data.
```{python, collapse = TRUE, out.width='80%', fig.cap = "Adjusted Closing Prices for a Stock Price Index", fig.align='center'}
import pandas as pd
df = pd.read_csv("data/gspc.csv")
df.head()
df['GSPC.Adjusted'].plot()
```
Choosing among nondefault plot types can be done in a variety of ways. You can either use the `.plot` accessor data member of a `DataFrame`, or you can pass in different strings to `.plot()`'s `kind=` parameter. Third, some plot types (e.g. boxplots and histograms) have their own dedicated methods.
```{python, collapse = TRUE, out.width='80%', fig.cap = "Three Ways to Get the Same Histogram", fig.align='center'}
df['returns'] = df['GSPC.Adjusted'].pct_change()
df['returns'].plot(kind='hist')
# same as df['returns'].plot.hist()
# same as df['returns'].hist()
```
There are also [several freestanding plotting functions](https://pandas.pydata.org/docs/user_guide/visualization.html#visualization-tools) (not methods) that take in `DataFrame`s and `Series` objects. Each of these functions is typically imported individually from the `pandas.plotting` submodule.
The following code is an example of creating a "lag plot," which is simply a scatterplot between a time series' lagged and nonlagged values. The primary benefit of this function over `.plot()` is that this function does not require you to construct an additional column of lagged values, and it comes up with good default axis labels.
```{python, collapse = TRUE, out.width='80%', fig.cap = "Lag Plot of Stock Index Returns", fig.align='center'}
from pandas.plotting import lag_plot
lag_plot(df['returns'])
```
## Exercises
### R Questions
1.
The density of a particular bivariate Gaussian distribution is
$$
f(x,y) = \frac{1}{2 \pi} \exp\left[ -\frac{x ^2 + y^2}{2} \right] \tag{1}.
$$
The random elements $X$ and $Y$, in this particular case, are independent, each have unit variance, and zero mean. In this case, the marginal for $X$ is a mean $0$, unit variance normal distribution:
$$
g(x) = \frac{1}{\sqrt{2\pi}} \exp\left[ -\frac{x ^2 }{2} \right] \tag{2}.
$$
a) Generate two plots of the bivariate density. For one, use `persp()`. For the other, use `contour()`.
b) Generate a third plot of the univariate density.
2.
Consider again the Militarized Interstate Disputes (v5.0) [@mid5] data set from [The Correlates of War Project](https://correlatesofwar.org/). A description of this data was given in the previous chapter. Feel free to re-use the code you used in the related exercise.
a) Use a scatterplot to display the relationship between the maximum duration and the end year. Plot each country as a different color.
### Python Questions
1.
Reproduce figure \@ref(fig:spline-plot)) in section \@ref(map) that displays the simple "spline" function. Feel free to use any of the visible code in the text.
2.
The "4 Cs of diamonds"--or the four most important factors that explain a diamond's price--are cut, color, clarity and carat. How does the price of diamonds vary as these three factors vary?
+ Consider the data set `diamonds.csv` from [@ggplot2] and plot the `price` against `carat`.
+ Visualize the other categorical factors simultaneously. Determine the most elegant way of visualizing all four factors at once and create a plot that does that. Determine the worst way to visualize all factors at once, and create that plot as well.