-
Notifications
You must be signed in to change notification settings - Fork 51
/
043_practice.Rmd
executable file
·144 lines (107 loc) · 10.9 KB
/
043_practice.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
<style>@import url(style.css);</style>
[Introduction to Data Analysis](index.html "Course index")
# 4.3. Practice
[bshor-data]: http://research.bshor.com/2012/10/31/individual-2012-congressional-candidate-scores/ "Individual 2012 Congressional Candidate Scores (Boris Shor)"
[bshor-iep]: http://research.bshor.com/2012/10/31/generating-the-2012-congressional-candidate-scores/ "Generating the 2012 Congressional Candidate Scores (Boris Shor)"
[bshor-detail]: http://research.bshor.com/2012/10/31/scoring-the-2012-congressional-candidates/ "Graphs of the 2012 Congressional Candidates (Boris Shor)"
[cb]: http://colorbrewer2.org/ "ColorBrewer 2.0 (Cynthia Brewer)"
[cs-split-apply-combine]: http://www.stat.cmu.edu/~cshalizi/statcomp/lectures/12/lecture-12.pdf "Lecture 12: Split/Apply/Combine with plyr (Cosma Shalizi)"
[ds-plyr]: http://is-r.tumblr.com/post/33765462561/the-distribution-of-ideology-in-the-u-s-house-with "The distribution of ideology in the U.S. House (with plyr) (David Sparks)"
[dw-source]: http://voteview.org/dwnominate.asp
[gs-aggr]: https://gastonsanchez.wordpress.com/2012/06/28/5-ways-to-do-some-calculation-by-groups/ "5 ways to do calculations by group (Gaston Sanchez)"
[hw-plyr]: http://www.jstatsoft.org/v40/i01/paper "The Split-Apply-Combine Strategy for Data Analysis (Hadley Wickham)"
[jb-summarize]: http://www.slideshare.net/jeffreybreen/grouping-summarizing-data-in-r "Grouping & Summarizing Data in R (Jeffrey Breen)"
[ns-apply]: http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/ "A brief introduction to “apply” in R (Neil Saunders)"
[so-apply]: http://stackoverflow.com/a/7141669/635806 "R Grouping functions: sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate vs… (StackOverflow)"
<p class="info"><strong>Instructions:</strong> this week's exercise is called <code><a href="4_congress.R">4_congress.R</a></code>. Download or copy-paste that file into a new R script, then open it and run it with the working directory set as the <code>IDA</code> folder. If you download the script, make sure that your browser preserved its <code>.R</code> file extension.</p>
```{r run-exercise, include = FALSE, results = 'hide'}
source("code/4_congress.R")
```
This week's applications use a combination of `plyr` and `ggplot2` functions to show how vectorization works to aggregate and reshape data in R. We do not need to cover [all options][so-apply], but we can survey some essential ones, as [Neil Saunders][ns-apply] and [Jeffrey Breen][jb-summarize] did in tutorials that help realizing how useful these operations are to manipulate datasets with R.
The exercise looks at estimates of Congressional ideology.
```{r iep-plot-auto, echo = FALSE, fig.height = 7}
g
```
## Grouping observations to plot
The data for our first application are estimates of Congressional ideology compiled by [Boris Shor][bshor-data] for the U.S. House of Representatives in 2012. The data are downloaded in XLSX and saved in a plain text comma-separated file. The variables stored in the dataset are documented in the [online codebook][bshor-data]) and stored in a `data.frame` object:
```{r bshor-data}
# Check result.
head(data[, 1:9])
```
We now have loaded some data where each row is a single observation, namely a member of the U.S. House of Representatives in 2012. Each member has been assigned an [ideal point estimate][bshor-iep] of his or her ideological stance in Congress, from liberal (-) to conservative (+). This methodology provides a measure of [party polarization][bshor-detail] in Congress, if we aggregate Congressmen by party.
Aggregating by party involves doing the following: determine all possible values taken by the `party` variable (and limiting ourselves to the first two in this case), get the data for just one party at a time, and compute the average ideological score in that group. This whole operation can be tiresomely carried with a loop, as below, but that is precisely the kind of approach that we will avoid later on.
```{r bshor-naive-loop}
# The naive approach: loop over levels, subset the data, compute the scores.
for(i in levels(data$party)) {
# Create a subset of the data for that party.
d = subset(data, party == i)
# Compute the mean ideal estimate point.
cat(i, nrow(d), mean(d$score, na.rm = TRUE), "\n")
}
```
This loop first determines the vector of party names `r levels(data$party)` (Democrat, Republican, Independent) and then painfully extracts the number of observations and mean ideological score for each party. This type of looping is easily avoided by using the vector of party names to split the data, apply a mean function to the splitted parts and combine it back to shape:
[![Split-Apply-Combine, by Cosma Shalizi "http://www.stat.cmu.edu/~cshalizi/statcomp/12/lectures/12/lecture-12.pdf"](images/split-apply-combine-shalizi.png)][cs-split-apply-combine]
The exercise shows how to use the `tapply()` function, which returns one mean value (third argument) based on ideological score (first argument) for each group of observations formed by party membership (second argument). The `tapply()` syntax is pretty straightforward, but it can also follow two alternative syntaxes using `aggregate` or `by()` with data frames:
```{r bshor-tapply}
# Simple aggregation with tapply().
tapply(data$score, data$party, mean)
# The by() version for data frames with factors.
by(data$score, data$party, mean)
# The aggregate() version with formula notation.
aggregate(score ~ party, data = data, FUN = "mean")
```
We can decompose what is happening here by calling another `apply()` function that allows to explain each step. In the `sapply()` example below, we write up an arbitrary function that extracts the mean ideological scores of one party after the other in the vector of unique party names. This example shows how much simplification work is being done in a command like `tapply()`.
```{r bshor-lapply}
sapply(levels(data$party), function(x) {
this.party = which(data$party == x)
mean(data$score[this.party]) })
```
Aggregating by groups also works visually, as we will see several times during the course from next week onwards, as we look into `ggplot2` syntax for plots where we want to visually discriminate some groups. The exercise ends on an example of that principle to stack the distributions of Congressional ideology by party. Distributions are important statistical functions studied later in the course.
```{r bshor-distributions, fig.width = 12, fig.height = 9, include = FALSE}
# RColorBrewer codes for blue, red, gray.
party.colors = brewer.pal(9, "Set1")[c(2, 1, 9)]
# Stacked distributions, colored by party.
qplot(data = data, x = score, fill = party, colour = party,
position = "stack", alpha = I(.75), geom = "density") +
scale_fill_manual("Party", values = party.colors) +
scale_colour_manual("Party", values = party.colors)
```
The graph implicitly groups observations by assigning different `fill` and `colour` attributes to each group of variable `party`. These groups are assigned colors that were previously extracted from [ColorBrewer][cb] palette `Set1`, using the `brewer.pal()` function of the `RColorBrewer` package to extract the nine set colors and select the blue (#2), red (#1) and grey (#9) tints.
## Summarizing more complex aggregate data
We continue to look at ideology in the U.S. Congress by plotting the [DW-NOMINATE][dw-source] index, using an example [adapted from David Sparks][ds-plyr]. Unlike the cross-sectional estimate for 2012 that we just used, this index is available for several years and therefore requires a more advanced reshaping strategy that splits the data by three groups: Congressional session, year, and party.
Exploring this dataset is easy if you need simple measures like raw frequencies. Here, for instance, is the number of different observations in the data for each major party affiliation, and the average DW-NOMINATE score in each group, obtained through formula notation with the `aggregate()` function.
```{r dwnominate-table-aggregate}
# Raw frequencies (N) by party.
table(dw$majorParty)
# Mean DW-NOMINATE score by party.
aggregate(dwnom1 ~ majorParty, dw, mean)
```
The exercise show more ways to do [calculations by groups][gs-aggr] at that stage and then turns to the `ddply()` function from the `plyr` package to aggregate the observations by major party within each congressional session, which makes for a more complex pattern than those seen so far. The crux is the `.variables` separator that groups the observations by Congress and major party.
```{r dwnominate-ddply, tidy = FALSE, eval = FALSE}
# David Sparks' transformation to session-party measurements, using plyr for the
# ddply() function and Hmisc for the weighted wtd.functions.
dw.aggregated <- ddply(.data = dw,
.variables = .(cong, majorParty),
.fun = summarise,
Median = wtd.quantile(dwnom1, 1/bootse1, 1/2),
q25 = wtd.quantile(dwnom1, 1/bootse1, 1/4),
q75 = wtd.quantile(dwnom1, 1/bootse1, 3/4),
q05 = wtd.quantile(dwnom1, 1/bootse1, 1/20),
q95 = wtd.quantile(dwnom1, 1/bootse1, 19/20),
N = length(dwnom1))
```
What happens here is a three-dimensional split of the data, first by Congressional session, then by political party. This operation is one of seven possible transformations in such settings. Hadley Wickham's `plyr` package is a set of commands like `ddply()` to implement these transformations, which he calls "[The Split-Apply-Combine Strategy for Data Analysis][hw-plyr]" and illustrates as goes:
![Split-Apply-Combine, by Hadley Wickham "http://www.jstatsoft.org/v40/i01/paper"](images/split-apply-combine-wickham.png)
The cube slices show the different split-ups that can be accomplished when you combine three variables (dimensions). In our example, the result is a dataset that contains a single observation by party (Democrat, Republican, Independent) and by Congress (1-111th), with its value being the median DW-NOMINATE score of that group. The results below are the results for the most recent Congress:
```{r dwnominate-111th-congress}
# Median DW-NOMINATE score of parties in the 111th Congress.
dw.aggregated[dw.aggregated$cong == 111, ]
```
The following plot derives from the combination of this strategy to a `ggplot` setting, where filling and colouring elements to reflect membership to a certain grouping in the data is straightforward. For more plots of Congressional ideology, check what David Sparks has done with reshaping to [cast, melt][ds-reshape] and [mutate][ds-mutate] a dataset containing identical data.
[ds-reshape]: http://is-r.tumblr.com/post/34556058683/ggtutorial-day-1-using-reshape "GGtutorial: Day 1 - using reshape() (David Sparks)"
[ds-mutate]: http://is-r.tumblr.com/post/34288940225/congressional-ideology-by-state "Congressional ideology by state (David Sparks)"
```{r dwnominate-plot-auto, fig.width = 9, fig.height = 6, echo = FALSE, cache = TRUE}
p
```
> __Next week__: [Clusters](/).
<!-- 050_clusters.html -->