-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsurvival_analysis.Rmd
234 lines (165 loc) · 7.73 KB
/
survival_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
---
title: 'Survival Analysis'
author: 'P. Schwarz'
date: "21/09/2021"
output:
pdf_document:
extra_dependencies: ["xcolor"]
html_document: default
word_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Data
The surgery data was split between two files and contains the following columns
Variable name | Description
------------- | --------------------------------------------------------------------
id | Patient ID
surgery_date | Date of surgery
event_date | Date of event
event | Yes=Death, No=Censored due to end of study
T|1=Surgery procedure A, 0= other surgery procedure
inflammation | Inflammation score; higher score is more inflammation
bmi| Body Mass Index (BMI)
age| Age, in years, at date of surgery
hospital | Hospital ID
volume| Hospital volume; mean number of patients undergoing surgery at the hospital annually
surgery_year| Year of surgery
prior_treatment| 1=No prior treatment for the disease, 2=Prior treatment A, 3=Prior treatment A+B
srh|Self reported health status; higher score is better health
surgery_type|1= Surgery type A, 2=Surgery type B, 3=Surgery type C
severity|Severity of disease as classified by a physician; higher score is more severe
sex|Sex
technique|Surgery technique; Open or Keyhole
dead90|Yes=Dead within 90 days from date of surgery, No=Not dead within 90 days from date of surgery
The goal is to conduct a survival analysis and a Cox proportional hazards regression based on the data sets.
\newpage
## Data Cleaning
Initially, the data had to be merged and cleaned. The two files provided were merged by their `id` column. The files both came with headers, but different character delimiters.
```{r, echo=FALSE}
s1 <- read.table("surgery1.txt", header = T, sep = ";")
s2 <- read.table("surgery2.txt", header = T, sep = ",")
surgery <- merge(s1, s2, by='id' )
```
```{r, include=FALSE}
#load all packages required
#install.packages("VIM")
library(VIM)
library(mice)
library(tidyverse)
library(survival)
library(survminer)
library(lubridate)
VIM::aggr(surgery)
```
An analysis of missing data was conducted, as is shown in the plot above, in which `severity` is shown as the column with more than 15% missing data points. After investigating the variable `technique` the missing values were re-coded to show as `NA` and therefore missing as well.
```{r, echo=FALSE, include=FALSE}
surgery <- surgery %>%mutate_if(is.character, as.factor)
levels(surgery$technique)
surgery$technique[surgery$technique ==""] <- "NA"
surgery$technique <- droplevels(surgery$technique)
levels(surgery$technique)
```
After this procedure the missingness within the data set was assessed again.
```{r, echo=FALSE}
VIM::aggr(surgery)
```
The plot above shows now also the missing values in the `technique` column.
Further data cleaning was conducted by transforming `inflammation`, `prior_treatment`, `srh` and `severity` into ordered factors. `T`, `hospital` and `surgery_type` were transformed into regular factors. All of these were initially classified as a different type, which could cause problems in the analysis of categorical variables.
Additionally, the variables `surgery_date` and `event_date` were transformed to dates.
```{r,echo=FALSE, include=FALSE}
surgery <- surgery %>%
mutate_at(c("inflammation", "prior_treatment", "srh", "severity"), as.ordered) %>%
mutate_at(c("T", "hospital", "surgery_type"), as.factor) %>%
mutate( surgery_date = ymd(surgery_date)) %>%
mutate( event_date = mdy(event_date))
```
A new variable called `severity_ind` was created based on the value in the column `severity`.
When `severity` is larger than `2` `severity_ind` takes value `1` and `0` if `severity` is lower.
```{r, include=FALSE}
surgery <- surgery %>%
mutate(severity_ind = factor(case_when(severity > 2 ~ "1", severity <= 2 ~ "0")))
```
Another new variable `Y` was added that contains the difference between `event_date` and `surgery_date` and will be used as the basis for the survival analysis later.
```{r, include=FALSE}
surgery <- surgery %>%
mutate(Y = difftime(event_date , surgery_date))
```
Lastly `surgery_date` and `event_date` were removed from the data set because the new variable made them obsolete.
```{r, include=FALSE}
new_data <- surgery%>%
select(-surgery_date, -event_date)
```
\newpage
## Dealing With Missing Data
To deal with the missing data a function was created that allows the deal with the missing data in 3 ways. The standard setting is to simply drop empty rows from the data set. Alternatively, hotdeck imputation or imputation based on the `mice` package can be conducted.
```{r, echo=FALSE}
handle_missing <- function(data, type = 'dropna', ...) {
if (type == 'dropna') {
imp <- data %>% drop_na()
}
else if (type =='hotdeck'){
imp <- VIM::hotdeck(data)
}
else if (type =='mice'){
imp <- complete(mice(data, method = c("norm", ...)), "broad")
}
return(imp)
}
```
```{r, echo=FALSE}
survival_data <- handle_missing(new_data, type = "hotdeck")
aggr(survival_data)
```
The missing data was imputed using hotdeck imputation. The graph above shows how the missing values in the data set are gone. The resulting data was used for a survival analysis based on the time between the surgery and the end of the observed time frame or the death of the patient. The strata are based on the two different treatments surgery A and another type of surgery. To conduct this procedure the time and the event variable were re-coded to be numeric values.
\newpage
## Survival Analysis
```{r, include=FALSE}
fit <- survfit(Surv(as.numeric(survival_data$Y) , as.numeric(survival_data$event)) ~ survival_data$`T`)
```
```{r, echo=FALSE , fig.width =6 ,fig.height = 6}
p <- ggsurvplot(
fit,
data = survival_data,
size = 1, # change line size
palette =
c("#E7B800", "#2E9FDF"),# custom color palettes
conf.int = TRUE, # Add confidence interval
pval = TRUE, # Add p-value
risk.table = TRUE, # Add risk table
risk.table.col = "strata",# Risk table color by groups
legend.labs =
c("Another surgery", "Surgery A"), # Change legend labels
risk.table.height = 0.25, # Useful to change when you have multiple groups
ggtheme = theme_bw() # Change ggplot2 theme
)
p
```
The result of the analysis is represented on the plot above.It can be seen that the comparison between the two surgery methods shows that surgery A results in a higher probability of patient survival in the time frame of the study. The difference between the two methods is statistically significant with a p value of 0.0013.
\newpage
## Cox proportional hazards regression
Based on the the survival model a Cox proportional hazards regression was performed using the following model specification:
$$
h_i(t)=exp(\beta_1T+\beta_2inflammation+\beta_3srh+\beta_4 surgery\_type+\beta_5 sex+\beta_6bmi+\beta_7age)h_0(t)
$$
```{r, echo=FALSE}
cox <- coxph(Surv(as.numeric(survival_data$Y) , as.numeric(survival_data$event)) ~ (survival_data$`T` + inflammation + `srh` + `surgery_type` + `sex` + `bmi` + `age`), data = survival_data)
summary(cox )
```
The summary of the survival model above provides an exponentiated coefficient for the treatment variable of 0.9471.
The 95% confidence interval for the treatment variable is [0.8379, 1.0705].
All of the covariates except the inflammation and the treatment are statistically significant with a p-value lower than 0.05.
\newpage
```{r, echo=FALSE}
cox.zph(cox)
```
As the table above shows, the test for proportional hazards is not statistically significant for any of the covariates, nor the global test. Proportional hazards can be assumed.
```{r, fig.width =8 ,fig.height = 8, echo=FALSE}
ggcoxzph(cox.zph(cox))
```
```{r, fig.width =6 ,fig.height = 6}
ggcoxdiagnostics(cox,
type = "schoenfeld"
)
```