Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR to add individual-level data cfr case study for ebola-like disease #1

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions R/cfr_ci.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Function to estimate mean CFR with 95% CI from a numeric vector
cfr_ci <- function(x) {
mean <- mean(x)
sd <- sd(x)
sem <- sd/ sqrt(length(x))
margin_of_error <- sem * 1.96
lower <- round(mean - margin_of_error, 2)
upper <- round(mean + margin_of_error, 2)
result <- list(
mean = round(mean, 2),
lower_ci = lower,
upper_ci = upper
)
return(result)
}
17 changes: 17 additions & 0 deletions R/convert_to_incidence.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Function to convert to incidence
convert_to_incidence <- function(sim) {
incidence_data <- incidence2::incidence(sim,
date_index = c("date_onset", "date_death", "date_recovery"),
interval = 1L) %>%
complete_dates() %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the newest version of {incidence2} allows you to set complete_dates within the incidence2::incidence() call.

pivot_wider(names_from = count_variable, values_from = count)
#Vector with names for renaming
rename_vec <- c("date_index" = "date", "date_onset" = "cases", "date_death" = "deaths" , "date_recovery" = "recovered")
# Rename columns conditionally
for (old_name in names(rename_vec)) {
if (old_name %in% colnames(incidence_data)) {
colnames(incidence_data)[colnames(incidence_data) == old_name] <- rename_vec[[old_name]]
}
}
Comment on lines +11 to +15
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this can be made much more efficient by vectorising the renaming. It will also require much less code than currently written. Happy to discuss.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to change it but I'm not 100% sure of what you mean, we can discuss it or you can suggest the changes and I'll accept them

return(incidence_data)
}
8 changes: 8 additions & 0 deletions R/estimate_cfr.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Function to calculate cfr by specifying columns from a dataframe with
# deaths and cases
estimate_cfr <- function(df, numerator, denominator) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the {cfr} package not calculate this when a delay distribution is not supplied?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it can, initially I wasn't using it for 2 reasons, 1/ because the output includes the 95 CI and I just wanted the CFR value, so I thought it would be easier to write a function that is doing exactly what I want it to do, 2/ because I wanted the package to be used just for the comparison of the methods, but I guess I could still use it anyway if a delay distribution is not being used... do you think it would be better?

total_deaths <- sum(df[[numerator]])
total_cases <- sum(df[[denominator]])
cfr <- total_deaths/total_cases*100
return(cfr)
}
14 changes: 14 additions & 0 deletions R/id_outbreak_peak.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Function to estimate the peak of the outbreak using incidence2
# Incidence object from linelist has 3 count_variables, but we want the column
# date_onset only to be considered to estimate the peak
id_outbreak_peak <- function(linelist) {
inc_data <- linelist %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function remakes an <incidence> object, but one is already created in the analysis script before calling this function. Might be worth seeing the <incidence> already made can be passed into this function and save yourself a step.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I meant the other day about having problems with this step, the thing is that in order to be able to use incidence2's function to estimate the outbreak peak, you need to start off an incidence object, and then it takes the date_index column to give you the peak. The problem is that when there are more than 1 elements in date_index, it calculates the peak for each of them, not just for the onset date (in my case, also for the deaths and recoveries). I tried starting off the previous incidence object that I use in the analyses and remove the rows in date_index that said "date_death" and "date_recovery" first so it would only take into account onset dates, but I didn't find it easy to do at the time and I was getting some unexpected results, so I decided it was best to start off the "correct" incidence object
If there is an easier way to do this I'm more than happy to change it

incidence2::incidence(date_index = "date_onset", interval = 1L) %>%
complete_dates()
peak <- inc_data %>%
estimate_peak(first_only = TRUE) %>%
pull(observed_peak) %>%
as.character()
return(peak)

}
25 changes: 25 additions & 0 deletions R/id_outbreak_phases_new.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Function that takes the outbreak's peak (identified previously by id_outbreak_peak),
# pastes this on a new column

id_outbreak_phases_new <- function(inc_data, outbreak_peaks){
inc_data$date <- as.Date(inc_data$date)
# Id start date
start_date <- min(inc_data$date)
# Select a possible date for the growing phase
possible_growing_dates <- inc_data %>%
filter(date > start_date + 10 & date < outbreak_peaks) %>%
pull(date)
# Select one of these dates at random
growing_date <- sample(possible_growing_dates, 1)
# Add new column with outbreak dates
inc_data <- inc_data %>%
mutate(outbreak_dates = case_when(
date == outbreak_peaks & !duplicated(date) ~ "peak",
cases >= 5 & row_number() == min(which(cases >= 5)) ~ "start",
date == max(date) & !duplicated(date) ~ "end",
date == growing_date ~ "growing",
TRUE ~ NA_character_
)
)
return(inc_data)
}
18 changes: 18 additions & 0 deletions R/id_outbreak_phases_old.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#Function to smooth outbreak trend and create a column that indicates the outbreak's
# peak (mean of window with highest no. of cases), start, and end
id_outbreak_phases_old <- function(inc_data) {
# setting window size
window_size <- 7
trend <- inc_data %>%
mutate(cases_ma = rollmean(cases, k = window_size, fill = NA, align = "center")) %>%
mutate(date = as.Date(date)) %>%
mutate(cases_ma = replace_na(cases_ma, 0)) %>%
mutate(outbreak_dates = case_when(
row_number() == which.max(cases_ma) ~ "peak",
row_number() == which.min(if_else(cases_ma > 1, row_number(), Inf)) ~ "start",
date == max(date) ~ "end",
TRUE ~ NA_character_
))
}


27 changes: 27 additions & 0 deletions R/run_ebola_simulations.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Function to run multiple instances of sim_linelist
run_ebola_simulations <- function(sim_count) {
# Initialize empty list to store simulation results
results <- vector("list", length = sim_count)

# Loop through simulations
for (i in 1:sim_count) {
# Set seed for reproducibility
set.seed(i) # Use loop counter for unique seeds

# Run the simulation code and store the result
results[[i]] <- sim_linelist(
contact_distribution = contact_distribution,
infectious_period = ip_ebola_epidist,
onset_to_hosp = NA,
hosp_risk = NA,
hosp_death_risk = NA,
onset_to_death = o_d_ebola_epidist,
onset_to_recovery = o_r_ebola_epidist,
prob_infect = 0.7,
outbreak_size = c(500, 2000),
non_hosp_death_risk = 0.3
)
}
# Return the list of simulation results
return(results)
}
14 changes: 14 additions & 0 deletions R/subset_outbreak_phase.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Function to select a random date within an outbreak phase and create a subset
# of data between relevant dates of the outbreak
subset_outbreak_phase <- function(data, from, to) {
# Sample a random date between "from" and "to" statuses
sampled_date <- data %>%
slice(sample(seq(which(outbreak_dates == from), which(outbreak_dates == to)), 1)) %>%
pull(date) %>%
as.character()

# Subset data based on the sampled date
subset_data <- data[data$date <= sampled_date, ]

return(subset_data)
}
16 changes: 16 additions & 0 deletions R/tidy_sim_results.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Function to process simulation results
tidy_sim_results <- function(sim_data) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure what this function is doing, but I think it's probably doing this: https://epiverse-trace.github.io/simulist/dev/articles/vis-linelist.html#reshape-line-list-data. If so I don't think you need to write a new function and can apply tidyverse functions over the list of line lists.

# Converting class to character so that it's possible to use ifelse
sim_data$date_outcome <- as.character(sim_data$date_outcome)
# Loop through each individual and update columns
for (i in 1:nrow(sim_data)) {
sim_data$date_death[i] <- ifelse(sim_data$outcome[i] == "died", sim_data$date_outcome[i], NA)
sim_data$date_recovery[i] <- ifelse(sim_data$outcome[i] == "recovered", sim_data$date_outcome[i], NA)
}
# Convert date columns to Date class
sim_data$date_death <- as.Date(sim_data$date_death, format = "%Y-%m-%d")
sim_data$date_recovery <- as.Date(sim_data$date_recovery, format = "%Y-%m-%d")

return(sim_data)
}

Loading