-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME.Rmd
228 lines (169 loc) · 13.2 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
# BradleyTerryScalable
[![Travis build status](https://travis-ci.org/EllaKaye/BradleyTerryScalable.svg?branch=master)](https://travis-ci.org/EllaKaye/BradleyTerryScalable)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/BradleyTerryScalable)](https://cran.r-project.org/package=BradleyTerryScalable)
An R package for fitting the Bradley-Terry model to pair-comparison data, to enable statistically principled ranking of a potentially large number of objects.
Given a number of items for which we have pair-comparison data, the Bradley-Terry model assigns a 'strength' parameter to each item. These can be used to rank the items. Moreover, they can be used to determine the probability that any given item will 'beat' any other given item when they are compared. Further details of the mathematical model, and the algorithms used to fit it, are available in the package vignette.
**The documentation website for this package can be found [here](https://ellakaye.github.io/BradleyTerryScalable/) (thanks to [`pkgdown`](https://github.com/hadley/pkgdown))**
## The Bradley-Terry model
Details of the Bradley-Terry model are not presented here (due to GitHub's inability to display equations). Instead, please refer to the vignette [here](https://ellakaye.github.io/BradleyTerryScalable/articles/BradleyTerryScalable.html) or through the R console:
```{r eval = FALSE}
vignette("BradleyTerryScalable", package = "BradleyTerryScalable")
```
## Installing the package
```{r eval = FALSE}
# installing from CRAN
install.packages("BradleyTerryScalable")
# installing from GitHub
install.packages("devtools") # if required
devtools::install_github("EllaKaye/BradleyTerryScalable", build_vignettes = TRUE)
```
```{r}
library(BradleyTerryScalable)
```
Please note that since this package contains compiled code, you will need to have developer tools installed. Please see [https://r-pkgs.org/setup.html#setup-tools](https://r-pkgs.org/setup.html#setup-tools) for further details.
## Preparing the data
The main model-fitting function in this package is `btfit()`. This function takes as its main argument an object of class `btdata`. To create a `btdata` object, use the function `btdata(x)`.
The `x` argument to `btdata` can be one of four classes of object:
- A matrix (either a base `matrix` or a class from the `Matrix` package), dimension $K$ by $K$, where $K$ is the number of items. The $i,j$-th element is $w_{ij}$, the number of times item $i$ has beaten item $j$. Ties can be accounted for by assigning half a win (i.e. 0.5) to each item.
- A contingency table of class `table`, similar to the matrix described in the above point.
- An `igraph`, representing the *comparison graph*, with the $K$ items as nodes. For the edges:
- If the graph is unweighted, a directed edge from node $i$ to node $j$ for every time item $i$ has beaten item $j$
- If the graph is weighted, then one edge from node $i$ to node $j$ if item $i$ has beaten item $j$ at least once, with the weight attribute of that edge set to the number of times $i$ has beaten $j$.
- A data frame (`data.frame` or `tibble`), with three or four columns
- If the data frame has three columns, then the first column must be the name of the first item, the second column must be the name of the second item, and the third column must be the number of times the first item has beaten the second item.
- If the data frame has four columns, then the first column must be the name of the first item, the second column must be the name of the second item, and the third column must be the number of times the first item has beaten the second item and the fourth column must be the number of times the second item has beaten the first item.
- In either of these cases, the data can be aggregated, or there can be one row per comparison.
- Ties can be accounted for by assigning half a win (i.e. 0.5) to each item.
We anticipate that the user may have data in a three-column data frame that does not match the description of the three-column data frame above. For example, the data frame could have one row per comparison, where the third column contains a code to indicate which of the two items won, say `W1` if the item in column 1 won, `W2` if the item in column 2 won and `D` if it was a tie/draw. Alternatively, the third column could contain the win-count, but only relative to the first item, i.e. 1 if the first item wins, 0 if it loses and 0.5 if there was a draw. In this case, the `btdata` function won't know that a loss for the first item item should be counted as a win for the second item.
For the cases described in the previous paragraph, the `BradleyTerryScalable` package provides the `codes_to_counts()` function, which takes such three-column data-frames and returns a four-column data frame of the required format for passing to the `btdata()` function.
The `BradleyTerryScalable` package provides two toy data sets which we'll use in this demonstration:
```{r}
data(citations)
citations
data(toy_data)
toy_data
```
`citations` is in an appropriate format to pass to `btdata()`, whereas `toy_data` needs to be passed through `codes_to_counts()` first:
```{r}
citations_btdata <- btdata(citations)
toy_data_4col <- codes_to_counts(toy_data, c("W1", "W2", "D"))
toy_btdata <- btdata(toy_data_4col, return_graph = TRUE)
```
A `btdata` object is a list containing two or three elements:
- `wins`: a matrix of the form described in the second bullet point above
- `components`: a list of the fully-connected components of the comparison graph (see the third bullet point above)
- `graph`: if `return_graph = TRUE`, then the `igraph` object of the comparison graph is returned, which can be useful for visualising the data.
```{r toy-graph, message = FALSE, fig.align = "center"}
library(igraph)
par(mar = c(0, 0, 0, 0) + 0.1)
plot.igraph(toy_btdata$graph, vertex.size = 28, edge.arrow.size = 0.5)
```
Information about the `btdata` objects can be seen through the `summary.btdata()` method:
```{r}
summary(citations_btdata)
summary(toy_btdata)
```
Note that components of size 1 will be filtered out in the MLE fit (see next section); the model doesn't make sense for them.
`select_components()` can be used to create a subset of a `btdata` object. In our toy case, the following all give the same subset:
```{r}
toy_btdata_subset <- select_components(toy_btdata, "3")
toy_btdata_subset <- select_components(toy_btdata, function(x) length(x) == 4)
toy_btdata_subset <- select_components(toy_btdata, function(x) "Cyd" %in% x)
summary(toy_btdata_subset)
```
Alternatively, set the `subset` argument in `btfit()`.
## Fitting the model
`summary.btdata(object)` gives information on whether or not the underlying comparison graph is fully connected. This affects the type of estimate available for the strength parameter.
- If the comparison graph is fully connected (i.e. if `Fully-connected: TRUE` is printed), then the maximum likelihood estimate (MLE) for the Bradley-Terry model exists and is finite.
- If the comparison graph is not fully connected (i.e. if `Fully-connected: FALSE` is printed), then we cannot find the MLE for the full dataset. There are two alternatives:
- Find the MLE on each fully connected component of the comparison graph (and note then that it is only meaningful to rank and compare items within the same component).
- Place a Bayesian prior on the model, then find the maximum a posteriori (MAP) estimate. This always exists and is finite. It gives a principled way of ranking and comparing all $K$ items.
The function `btfit()` requires two arguments: the data (in the form of a `btdata` oject), and `a`, which controls whether the MLE or MAP estimate is returned.
- If `a = 1`, the MLE is returned, either on the full dataset if the comparison graph is fully-connected, or else the MLE is found separately for each fully-connected component.
- If `a > 1`, the MAP estimate is returned, with `a` as the value of the shape parameter in the prior.
See `?btfit()` or the *Fitting the Bradley-Terry model* section above for more details.
```{r}
citations_fit <- btfit(citations_btdata, 1)
toy_fit_MLE <- btfit(toy_btdata, 1)
toy_fit_MAP <- btfit(toy_btdata, 1.1)
```
`btfit` objects are lists, and they are not designed to be examined directly, but to be passed to other methods.
## Methods for a `btfit` object
### `summary.btfit()`, `coef.btfit()` and `vcov.btfit()`
The `summary.btfit()` method returns a list with
- `call`: the call to `btfit()`
- `item_summary`: a data frame with one row for each item in the fit (note that this can be fewer than the number of items in the data, if there were any components of size one, or if the fit was on a subset). Items are ranked in descending order *within each component*
- `component_summary`: a data frame with one row per component in the fit.
The standard errors are *not* returned by default (since the underlying `vcov.btfit()` function can be slow for large matrices), but can be included by setting `SE = TRUE`. It is also possible to set a reference item, and to return the summary for only a subset of components (see `?summary.btfit()`).
The `coef.btfit()` method extracts the parameter estimates. This is the strength parameter, on the log scale, constrained (by default) such that the mean of the estimates is zero. By default it is a vector if `btfit()` was run on the full dataset, or a list of vectors otherwise, but there is also the possibility of returning a data frame by setting `as_df = TRUE`.
The `vcov.btfit()` method returns the variance-covariance matrix (or a list of these matrices by component), and also has `ref` and `subset` arguments (see `?vcov.btfit()`).
```{r}
summary(citations_fit)
summary(toy_fit_MLE, SE = TRUE)
coef(toy_fit_MAP)
vcov(citations_fit, ref = "JASA")
```
### `btprob()` and `fitted.btfit()`
The `btprob` function calculates the Bradley-Terry probabilities that item $i$ beats item $j$. By default the result is a matrix if `btfit` was run on the full dataset, or a list of matrices otherwise, but there is also the possibility of returning a data frame by setting `as_df = TRUE`. The `fitted.btfit()` method functions similarly, except it returns the expected number of wins (see `?fitted.btfit()`).
```{r}
btprob(citations_fit)
fitted(toy_fit_MLE, as_df = TRUE)
```
### `simulate.btfit()` and `simulate_BT()`
There are two functions to simulate data from a Bradley-Terry model. The S3 method `simulate.btfit()` takes a `btfit` object *which has been fitted on one component* (either the full dataset, or a one-component subset). The underlying function `simulate_BT()` takes an `N` matrix (i.e. where the $i,j$-th element is the number of times items $i$ and $j$ have been compared) and a vector `pi`, the strength parameters of a Bradley-Terry model (note that `pi` is *not* the same as the estimates in `coef.btfit()` and `summary.btfit()`, which are on the logarithmic scale). Both functions return a `wins` matrix by default, but can also be set to return a `btdata` object instead.
For example, we can simulate 100 new datasets from the fitted model for the small `citations` dataset:
```{r}
citations_sim <- simulate(citations_fit, nsim = 100, seed = 1)
citations_sim[1:2]
```
As a bigger example, let's simulate a single instance of a fairly sparse tournament with 1000 items (or 'players'), and then fit the Bradley-Terry model to the resulting data:
```{r simulation-graph, warning = FALSE, message = FALSE, fig.width = 7, fig.height = 7, out.width = '97%'}
library(Matrix)
library(dplyr)
library(ggplot2)
set.seed(1989)
n_items <- 1000
## Generate at random a sparse, symmetric matrix of binomial totals:
Nvalues <- rpois(n = n_items * (n_items - 1) / 2, lambda = 1)
notzero <- Nvalues > 0
Nmatrix <- Matrix(nrow = n_items, ncol = n_items)
ij <- which(lower.tri(Nmatrix), arr.ind = TRUE)[notzero, ]
Nmatrix <- sparseMatrix(
i = ij[, 1],
j = ij[, 2],
x = Nvalues[notzero],
symmetric = TRUE,
dims = c(n_items, n_items))
## Generate at random the (normalized to mean 1) 'player abilities':
pi_vec <- exp(rnorm(n_items) / 4)
pi_vec <- pi_vec / mean(pi_vec)
## Now generate contest outcome counts from the Bradley-Terry model:
big_matrix <- simulate_BT(pi_vec, Nmatrix, nsim = 1, seed = 1)[[1]]
big_btdata <- btdata(big_matrix)
## Fit the Bradley-Terry model to the simulated data:
the_model <- btfit(big_btdata, a = 1)
pi_fitted <- the_model $ pi $ full_dataset
## Plot fitted vs true abilities:
plot_df <- tibble(x = log(pi_vec[as.numeric(names(pi_fitted))]),
y = log(pi_fitted))
ggplot(plot_df, aes(x, y)) +
geom_point(alpha = 0.5) +
geom_abline() +
xlab("true strength") +
ylab("maximum likelihood estimate") +
ggtitle("1000-player simulation from a Bradley-Terry model") +
theme(plot.title = element_text(hjust = 0.5))
```
## Further information
All code for the package is available at [https://github.com/EllaKaye/BradleyTerryScalable](https://github.com/EllaKaye/BradleyTerryScalable) and a documentation website is available at [https://ellakaye.github.io/BradleyTerryScalable](https://ellakaye.github.io/BradleyTerryScalable)