PC_lab_R.Rmd

---
title: "PC Lab for Advanced Health Economic Modeling"
author: "Erasmus University Rotterdam"
output: html_document
---

# Advanced Health Economic Modeling

**Course Coordinators**: Maiwenn Al, Eline Krijkamp and Venetia Qendri
**Assignments development:** In collaboration with Bart-Jan Boverhof

---

## Important Instructions

- **Read all the text in this document carefully.**
- **Ensure you have opened the AHEM computer lab project.**

### How to Open an R Project

To open an R project, click on the `.Rproj` file in your project directory. In the upper-right corner of RStudio, you should see the name of the folder where the `.Rproj` file is stored.

### Abbreviations

- **OS**:   Overall survival
- **PFS**:  Progression-free survival
- **mito**: Mitoxantrone
- **caba**: Cabazitaxel
- **m_**:   Prefix for matrix
- **d_**:   Prefix for dataframe
- **l_**:   Prefix for list

### Document Structure

This document is divided into two main parts:

1. **Part I**
   - **What**: Demonstrates R code functions step-by-step using one dataset (Overall Survival Mitoxantrone).
   - **Aim**:  Helps you understand the code and view all intermediate steps clearly.

2. **Part II**
   - **What**: Implements loops and functions based on Part I.
   - **Aim**:  Runs the code from Part I for all datasets (PFS mito & caba, OS mito & caba).

---

## Part 0.0: Setting Up R

### Description

This section clears the environment, installs necessary packages (if required), and loads the required libraries. Additionally, it loads and cleans the data from an Excel file.

```{r setup}
rm(list = ls(all = TRUE)) # Clear all information in the environment
# Uncomment the following line if you need to install the packages
# install.packages(c("survival", "readxl", "openxlsx"))
library(survival)         # Load the survival package
library(readxl)           # Load the readxl package
library(openxlsx)         # Load the openxlsx package
# Load the data from Excel
# Ensure the Excel file is populated with the necessary information.
df_OS_mito       <- na.omit(read_excel("data/Hoyle_and_Henley_Bahl_OS_Mito.xlsm", sheet = "R data", range = cell_cols("A:F")))
df_times_OS_mito <- na.omit(read_excel("data/Hoyle_and_Henley_Bahl_OS_Mito.xlsm", sheet = "R data", range = cell_cols("G:H")))

```

---

## Part I: Step-by-Step Demonstration

### Step 1.0: Storing and Attaching Data

This section stores the `df_OS_mito` dataframe into a general variable `data` and attaches it to the R search path for easier evaluation of variables.

```{r attach_data}
data <- df_OS_mito
attach(data) # You can ignore the warning about "masked objects" 

v_names_trt  <- c("Cabazitaxel", "Mitoxantrone") # vector with the treatment names 
v_names_data <- c("PFS_caba","PFS_mito", "OS_caba", "OS_mito") # vector for the different data sets
v_names_dist <- c("exponential", "weibull", "lognormal", "loglogistic") # define the distribution we are evaluating
```

### Step 1.1: Extracting Time Points

Here, we extract and combine time points for censored and event individuals. It also includes time points for patients at risk at the last time point.

```{r extract_timepoints}
times_start <- c(rep(start_time_censor, n_censors), rep(start_time_event, n_events))
times_end   <- c(rep(end_time_censor, n_censors),   rep(end_time_event,   n_events))

# Adding times for patients at risk at the last time point
times_start <- c(times_start, rep(df_times_OS_mito$n_last_time_point, df_times_OS_mito$n_patients_at_risk))
times_end   <- c(times_end,   rep(10000, df_times_OS_mito$n_patients_at_risk))
df_times    <- cbind(times_start, times_end)
head(df_times)
```

### Step 1.2: Creating Survival Models

This section creates survival models for each distribution (exponential, Weibull, lognormal, and log-logistic) using interval censoring.

```{r create_models}
model_exp  <- survreg(Surv(times_start, times_end, type = "interval2") ~ 1, dist = "exponential")   # Exponential function
model_wei  <- survreg(Surv(times_start, times_end, type = "interval2") ~ 1, dist = "weibull")       # Weibull function
model_logn <- survreg(Surv(times_start, times_end, type = "interval2") ~ 1, dist = "lognormal")     # Lognormal function
model_logl <- survreg(Surv(times_start, times_end, type = "interval2") ~ 1, dist = "loglogistic")   # Log-logistic function
```

### Step 1.3: Calculating and Comparing AIC Values

AIC values are calculated for each model to evaluate model fit. Lower AIC values indicate better fit.

```{r calculate_aic}
AIC_exp  <- -2 * summary(model_exp)$loglik[1]  + 2 * 1
AIC_wei  <- -2 * summary(model_wei)$loglik[1]  + 2 * 2
AIC_logn <- -2 * summary(model_logn)$loglik[1] + 2 * 2
AIC_logl <- -2 * summary(model_logl)$loglik[1] + 2 * 2

v_AIC <- c(exponential = AIC_exp, 
           weibull     = AIC_wei, 
           lognormal   = AIC_logn, 
           loglogistic = AIC_logl)
v_AIC # print the AIC values

v_AIC[order(-v_AIC)] # print the values in order
```

### Step 1.4: Extracting Parameters and Creating Matrices

Intercept and log scale parameters for each model are extracted and stored in a matrix for further analysis.

```{r extract_parameters}
intercept_exp  <- summary(model_exp)$table ["(Intercept)", "Value"]
intercept_wei  <- summary(model_wei)$table ["(Intercept)", "Value"]
intercept_logn <- summary(model_logn)$table["(Intercept)", "Value"]
intercept_logl <- summary(model_logl)$table["(Intercept)", "Value"]

log_scale_wei  <- summary(model_wei)$table ["Log(scale)", "Value"]
log_scale_logn <- summary(model_logn)$table["Log(scale)", "Value"]
log_scale_logl <- summary(model_logl)$table["Log(scale)", "Value"]

v_intercept        <- c(intercept_exp, intercept_wei, intercept_logn, intercept_logl)
v_log_scale        <- c(NA,            log_scale_wei, log_scale_logn, log_scale_logl)
names(v_intercept) <- names(v_intercept) <- v_names_dist

m_model_parameters_OS_mito <- matrix(NA, nrow = 3, ncol = length(v_names_dist),
                                     dimnames = list(c("AIC", "intercept", "log(scale)"),
                                                     v_names_dist))
m_model_parameters_OS_mito["AIC",  ]       <- v_AIC
m_model_parameters_OS_mito["intercept",  ] <- v_intercept
m_model_parameters_OS_mito["log(scale)", ] <- v_log_scale

m_model_parameters_OS_mito <- round(m_model_parameters_OS_mito, 4)
m_model_parameters_OS_mito
```

### Step 1.5: Cholesky Matrices for Sensitivity Analysis

Cholesky matrices, capturing parameter variance and covariance, are created for probabilistic sensitivity analysis.

```{r cholesky_matrices}
cholesky_exp  <- t(chol(summary(model_exp)$var))
cholesky_wei  <- t(chol(summary(model_wei)$var))
cholesky_logn <- t(chol(summary(model_logn)$var))
cholesky_logl <- t(chol(summary(model_logl)$var))

l_cholesky <- list("exponential"  = cholesky_exp,
                   "weibull"      = cholesky_wei,
                   "lognormal"    = cholesky_logn,
                   "loglogistic"  = cholesky_logl)

l_cholesky
```

---

## Part II: Performing Analysis for Each Dataset

This section automates the analysis from Part I using functions and loops, applying it to multiple datasets.

---

## Step 2.0: Loading Data

### Description

This section loads data from multiple Excel files, ensuring empty rows are removed using `na.omit`. It also organizes the datasets and time points into lists for streamlined analysis.

```{r load_data}

# Load the data from the Excel files
df_PFS_caba <- na.omit(read_excel("data/Hoyle_and_Henley_PFS_Caba.xlsm",     sheet = "R data", range = cell_cols("A:F")))
df_PFS_mito <- na.omit(read_excel("data/Hoyle_and_Henley_PFS_Mito.xlsm",     sheet = "R data", range = cell_cols("A:F")))
df_OS_caba  <- na.omit(read_excel("data/Hoyle_and_Henley_Bahl_OS_Caba.xlsm", sheet = "R data", range = cell_cols("A:F")))


# Create a list with the dataframes
l_data_survival <- list("PFS_caba" = df_PFS_caba, 
                        "PFS_mito" = df_PFS_mito, 
                        "OS_caba"  = df_OS_caba, 
                        "OS_mito"  = df_OS_mito)


# Add the time of the last time point and the number of patients at risk
df_times_PFS_caba <- na.omit(read_excel("data/Hoyle_and_Henley_PFS_Caba.xlsm",     sheet = "R data", range = cell_cols("G:H")))
df_times_PFS_mito <- na.omit(read_excel("data/Hoyle_and_Henley_PFS_Mito.xlsm",     sheet = "R data", range = cell_cols("G:H")))
df_times_OS_caba  <- na.omit(read_excel("data/Hoyle_and_Henley_Bahl_OS_Caba.xlsm", sheet = "R data", range = cell_cols("G:H")))


# Create a list with the data for time points
l_times <- list("PFS_caba" = df_times_PFS_caba,
                "PFS_mito" = df_times_PFS_mito,
                "OS_caba"  = df_times_OS_caba,
                "OS_mito"  = df_times_OS_mito)

```

---

## Step 2.1: Creating Survival Models for Each Dataset

### Description

This section defines the `get_survival_parameters` function to calculate survival parameters for multiple distributions. The function processes each dataset, estimates model parameters, and organizes results into lists for further analysis.

```{r define_function}
get_survival_parameters <- function(data_survival, data_times, distributions = c("exponential", "weibull", "lognormal", "loglogistic")) {
  l_results <- l_model <- l_cholesky <- list()
  v_AIC <- v_intercept <- v_log_scale <- vector()

  m_model_parameters <- matrix(NA, nrow = 3, ncol = length(distributions),
                               dimnames = list(c("AIC", "intercept", "log(scale)"),
                                               distributions))

  times_start <- c(rep(data_survival$start_time_censor, data_survival$n_censors), rep(data_survival$start_time_event, data_survival$n_events))
  times_end   <- c(rep(data_survival$end_time_censor, data_survival$n_censors),   rep(data_survival$end_time_event,   data_survival$n_events))

  times_start <- c(times_start, rep(data_times$n_last_time_point, data_times$n_patients_at_risk))
  times_end   <- c(times_end, rep(10000, data_times$n_patients_at_risk))

  for (d in distributions) {
    model <- survreg(Surv(times_start, times_end, type = "interval2") ~ 1, dist = d)
    n_AIC <- -2 * summary(model)$loglik[1] + 2 * ifelse(d == "exponential", 1 , 2)
    n_intercept <- summary(model)$table[1]
    n_log_scale <- ifelse(d != "exponential", summary(model)$table[2], NA)
    m_cholesky  <- t(chol(summary(model)$var))

    l_results[[d]] <- list(model     = model,
                           AIC       = n_AIC,
                           intercept = n_intercept,
                           log_scale = n_log_scale,
                           cholesky  = m_cholesky)

    m_model_parameters["AIC",        which(d == distributions)] <- n_AIC
    m_model_parameters["intercept",  which(d == distributions)] <- n_intercept
    m_model_parameters["log(scale)", which(d == distributions)] <- n_log_scale

    m_model_parameters <- round(m_model_parameters, 4)
  }

  l_results$summary <- m_model_parameters
  return(l_results)
}
```

---

## Step 2.2: Applying the Function to All Datasets

### Description

This section uses loops to run the `get_survival_parameters` function on each dataset. The results are saved as a list and exported to an RDS file for later use.

```{r apply_function}
l_results_all <- list()
for (df in names(l_data_survival)) {
  l_results_all[[df]] <- get_survival_parameters(data_survival = l_data_survival[[df]], data_times = l_times[[df]])
}

saveRDS(l_results_all, file = "output/l_results_all.rds")
```

### Step 2.2.1: Print specific values from a list. 
To print values from a list in R, you can simply use the `print()` function or access specific elements using the list's structure. After running the code chunk above, we have the list called `l_results_all`. To print the entire list, use `print(l_results_all)` or simply type `l_results_all` in the console. To print specific parts or event elements, use the `$` operator for named elements, like `l_results_all$PFS_caba``, or use double square brackets with the index or name, such as `print(l_results_all[[1]])` or `print(l_results_all[["PFS caba`"]])`. You can also loop through the list using `for` loops, like `for (item in l_results_all) { print(item) }`, to print each value individually.

---

## Step 2.3: Organizing Results for Reporting - Parametric Survival Models

### Description

This section demonstrates how to extract specific results from the list and organize them into an Excel workbook for further analysis. The excel workbooks are stored in the output folder. Please look in this folder to copy and paste the results.

```{r organize_results}
library(openxlsx)
wb <- createWorkbook("AHEM student")
addWorksheet(wb, "PFS_caba", gridLines = FALSE, tabColour = "coral")
addWorksheet(wb, "PFS_mito", gridLines = FALSE, tabColour = "coral3")
addWorksheet(wb, "OS_caba",  gridLines = FALSE, tabColour = "skyblue")
addWorksheet(wb, "OS_mito",  gridLines = FALSE, tabColour = "skyblue4")

for (i in v_names_data) {
  print(i)
  print(l_results_all[[i]]$summary)
  writeData(wb, sheet = i, x = l_results_all[[i]]$summary, rowNames = TRUE)
}

saveWorkbook(wb, "output/output_parametric_survival_model_parameters.xlsx", overwrite = TRUE)
```

---

## Part III: Parametric Survival Formulas

### Description

This section calculates survival probabilities at a specified time point (15 weeks) using parametric distributions, based on results for Overall Survival (OS) with Mitoxantrone (Mito). The Parametric Survival equation are provided in the assignment instructions. 

**It is your task to use these equations with the estimated values in this assignment**

You can use R as your calculator, by writing the equation and you let R do the math. Below an example, of how you can write an equation in R. 

```{r}
# Example of a basic equation: 
# four + five to the power of 2 + the exponential of 10
example_outcome <- 4 + 5^2 + exp(10)
example_outcome # Print the example outcome
```


Alternative, you can write a function to do the calculations. The benefit of a function, is that you can repeat the calculation many times by just changing the input parameters. 

These function make use of the parameters for time, the intercept and the scale of the fitted distribution. You have to give the function the right input parameters to estimate the survival probability at that specific time point. 

We have given you one example. 

Now it is your turn to do the rest of the calculations. You can do so by writing an equation for the other distributions (weibull, lognormal and loglogistic), or you simply write the equations and let R do the math.

```{r parametric_formulas}
# Ignore this code for now, used for the solutions
# source("solution_estimate survival prob.R")

# Extract the results for OS Mito
l_OS_mito <- l_results_all$OS_mito

t_weeks <- 15           # number of weeks
t <- (t_weeks/ 52) * 12 # time in months
cat(t_weeks, "weeks is the same as", round(t, 3), "months", "\n")


### Exponential ### 
# Function to calculate survival probability using the Exponential distribution
exponential_survival <- function(t, intercept) {
  # t: time point in months at which the survival probability is estimated
  # intercept: estimate of the intercept for the survival curve (from model fit)
  
  # Calculate lambda (scale parameter) based on the intercept value.
  # Here, lambda is the exponential of the negative of the intercept.
  lambda <- exp(-intercept)  # Scale parameter for the Exponential distribution
  
  # Calculate the survival probability at time t using the Exponential survival function.
  S_t <- exp(-t * lambda)
  
  return(S_t)   # Return the calculated survival probability at time t
}

# Calculation for exponential distribution
intercept_exp   <- l_OS_mito$exponential$intercept
p_survival_exp  <- exponential_survival(t, intercept_exp)

cat("Survival probability (Exponential) at: ", round(52/t), "weeks :", p_survival_exp, "\n")


# YOUR TURN ##

### weibull ### 
#p_survival_weibull

### lognormal ### 
#p_survival_lognormal

### Exponential ### 
#p_survival_loglogistic


## Summary ##

# Combine and display results
#v_survival_props <- round(c(p_survival_exp, p_survival_weibull, p_survival_lognormal, p_survival_loglogistic), 4)
#names(v_survival_props) <- c("exponential", "weibull", "lognormal", "loglogistic")
#round(v_survival_props, 3) * 100


```


## Part IV Organizing Results for Reporting - Cholesky Matrix
This section calculated the Cholesky decomposition matrix for all the survival distributions and saves it in an Excel file. The Excel workbook can be found in the "output" folder. You can use the values from there for your Cabazitaxel model. 

```{r}
m_cholesky_exponential <- matrix(NA, nrow = 1, ncol = 4, 
                                 dimnames = list("exponential", 
                                                 rev(v_names_data)))

# Create a list for the cholesky matrix
l_cholesky_weibull <- l_cholesky_lognormal <- l_cholesky_loglogistic <- list()
                                

# Create table for the Cholesky matrices 
for (i in v_names_data){
  m_cholesky_exponential[, i] <- (l_results_all[[i]]$exponential$cholesky)
  l_cholesky_weibull[[i]]     <- as.matrix((l_results_all[[i]]$weibull$cholesky))
  l_cholesky_lognormal[[i]]   <- (l_results_all[[i]]$lognormal$cholesky)
  l_cholesky_loglogistic[[i]] <- (l_results_all[[i]]$loglogistic$cholesky)
}

# Load necessary library
library(openxlsx)

# Initialize workbook
wb_2 <- createWorkbook("AHEM student")

# Add worksheets with colors
addWorksheet(wb_2, "exponential", gridLines = FALSE, tabColour = "skyblue4")
addWorksheet(wb_2, "weibull",     gridLines = FALSE, tabColour = "coral3")
addWorksheet(wb_2, "lognormal",   gridLines = FALSE, tabColour = "skyblue")
addWorksheet(wb_2, "loglogistic", gridLines = FALSE, tabColour = "coral")

# Define the fixed order
fixed_order <- c("PFS_caba", "PFS_mito", "OS_caba", "OS_mito")
# Write m_cholesky_exponential directly
writeData(wb_2, sheet = "exponential", x =  m_cholesky_exponential, rowNames = TRUE)


# Write l_cholesky_weibull in the fixed order
startRow <- 1
for (name in fixed_order) {
  if (name %in% names(l_cholesky_weibull)) {
    writeData(wb_2, sheet = "weibull", x = paste("Model:", name), startRow = startRow, startCol = 1)
    writeData(wb_2, sheet = "weibull", x = l_cholesky_weibull[[name]], startRow = startRow + 1, rowNames = TRUE)
    startRow <- startRow + nrow(l_cholesky_weibull[[name]]) + 3
  }
}

# Write l_cholesky_lognormal in the fixed order
startRow <- 1
for (name in fixed_order) {
  if (name %in% names(l_cholesky_lognormal)) {
    writeData(wb_2, sheet = "lognormal", x = paste("Model:", name), startRow = startRow, startCol = 1)
    writeData(wb_2, sheet = "lognormal", x = l_cholesky_lognormal[[name]], startRow = startRow + 1, rowNames = TRUE)
    startRow <- startRow + nrow(l_cholesky_lognormal[[name]]) + 3
  }
}

# Write l_cholesky_loglogistic in the fixed order
startRow <- 1
for (name in fixed_order) {
  if (name %in% names(l_cholesky_loglogistic)) {
    writeData(wb_2, sheet = "loglogistic", x = paste("Model:", name), startRow = startRow, startCol = 1)
    writeData(wb_2, sheet = "loglogistic", x = l_cholesky_loglogistic[[name]], startRow = startRow + 1, rowNames = TRUE)
    startRow <- startRow + nrow(l_cholesky_loglogistic[[name]]) + 3
  }
}

# Save the workbook
saveWorkbook(wb_2, "output/output_cholensky.xlsx", overwrite = TRUE)
```