-
Notifications
You must be signed in to change notification settings - Fork 5
/
README.Rmd
153 lines (114 loc) · 4.97 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
dpi = 300,
warning = FALSE,
echo = FALSE,
comment = "#>"
)
library(table.glue)
targets::tar_load(
names = c(bm_pred_time_viz,
bm_pred_model_viz,
bm_vi_viz)
)
```
# aorsf-bench
<!-- badges: start -->
<!-- badges: end -->
The goal of aorsf-bench is to introduce and evaluate
1. a method to increase the computational efficiency of (i.e., accelerate) the oblique random survival forest (RSF)
1. a method to estimate importance of individual predictor variables with the oblique RSF.
The entire project is summarized in a paper: `paper/arxix/main.pdf`.
## tl;dr
- We made oblique RSFs faster and also developed a new method to estimate importance of individual predictors with them. These methods are available in the `aorsf` R package.
- We find that the accelerated oblique RSF (`aorsf-fast` in the paper) is fast and accurate. In a benchmark with 35 different risk prediction tasks, `aorsf-fast` was faster than all of the learners we analyzed except for penalized Cox regression models (`glmnet-cox` in the paper). __Figure__: Distribution of time taken to fit a prediction model and compute predicted risk. The median time, in seconds, is printed and annotated for each learner by a vertical line.
```{r, fig-bm-time, fig.align='center', fig.width=8, fig.height=8}
bm_pred_time_viz$fig
```
- We find that `aorsf-fast` has the best index of prediction accuracy out of all the learners we evaluated. __Figure__: Expected differences in index of prediction accuracy between the accelerated oblique random survival forest and other learning algorithms. A region of practical equivalence is shown by purple dotted lines, and a boundary of non-zero difference is shown by an orange dotted line at the origin.
```{r, fig-bm-ibs, fig.align='center', fig.width=9, fig.height=8}
bm_pred_model_viz$fig$ibs_scaled
```
- We find that negation variable importance improves the chances of ranking a relevant variable as more important than an irrelevant variable when using an oblique RSF to estimate variable importance. __Figure__: Concordance statistic for assigning higher importance to relevant versus irrelevant variables. Text appears in rows where negation importance obtained the highest concordance, showing absolute and percent improvement over the second best technique.
```{r, fig-bm-vi, fig.align='center', fig.width=8, fig.height=10}
bm_vi_viz$fig
```
## Reproducing this paper
We use `targets` to coordinate our numerical experiments. Thus, our results on publicly available data can be reproduced by cloning our repo and running `tar_make()` with only publicly available data requested in the `targets` pipeline. Below are the steps involved:
1. Clone our repo and open the associated Rstudio project.
2. Open `packages.R` and ensure you have installed the R packages listed in this file.
3. In `_targets.R`, edit the vector of datasets passed into our `targets` pipeline so that it only contains publicly available data. Below is a code chunk showing which datasets you need to comment out to request only the publicly available ones:
```{r data-to-omit, echo=TRUE}
data_source = c(
"veteran",
"colon_recur",
"colon_acm",
"pbc_orsf",
"time_to_million",
"gbsg2",
"peakV02",
"flchain",
"nafld",
"rotterdam_recur",
"rotterdam_acm",
"actg_aids",
"actg_death",
# "guide_it_cvd",
# "guide_it_hfhosp",
"breast",
# "sprint_cvd",
# "sprint_acm",
"nki",
"lung",
"lung_ncctg",
"follic_death",
"follic_relapse",
"mgus2_death",
"mgus2_pcm"
# "mesa_hf",
# "mesa_chd",
# "mesa_stroke",
# "mesa_death",
# "aric_hf",
# "aric_chd",
# "aric_stroke",
# "aric_death",
# "jhs_stroke",
# "jhs_chd"
)
```
4. Create the `nafld` data by running `nafld_build()`, which is an R function included in `R/nafld_build.R` that combines multiple datasets in the `survival` package into a dataset we analyzed in the current study. An example of how this can be done is below:
```{r, eval=FALSE}
# load all R functions in the R directory
R.utils::sourceDirectory("R")
# load all relevant packages for this project
source("packages.R")
# run nafld_build() (takes several minutes)
nafld_build()
```
5. Run `targets::tar_make()` to make the targets pipeline, but be aware that it will take a very long time to make the pipeline as-is.
To make the process run quickly, I recommend setting `run_seed` to be `1:3` in `_targets.R` to request only 3 replications of Monte-Carlo cross validation be completed. Also in `_targets.R`, use a subset of the faster learners by commenting out the slower ones:
```{r models-to-omit, echo=TRUE}
# slower model fitters are commented out
model_fitters <- c(
'aorsf_fast',
'aorsf_cph',
# 'aorsf_random',
# 'aorsf_net',
# 'rotsf',
# 'rsfse',
# 'cif',
'cox_net',
# 'coxtime',
# 'obliqueRSF',
'xgb_cox',
# 'xgb_aft',
'randomForestSRC',
'ranger'
)
```