Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR to add individual-level data cfr case study for ebola-like disease #1

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

CarmenTamayo
Copy link
Collaborator

Steps included

  1. Define parameters for simulist, that include a realistic onset-death delay and a longer onset-recovery
  2. Function run_ebola_simulations that uses simulist to create a list with n=100 (for now) linelists
  3. Function tidy_sim_results, that creates 2 new columns (date_onset and date_recovery) and makes sure they're on a date format
  4. Function convert_to_incidence to use incidence2 and have a list of df with daily aggregated data, pivots the tables so they can be used with cfr, and makes sure the names of the columns are correct
  5. Function id_outbreak_peak to create a vector that contains all peak dates for every df using incidence2::estimate_peak
  6. Function id_outbreak_phases_new- 4 key outbreak dates
    • Start- first date with > 5 cases
    • Peak- using vector from id_outbreak_peak
    • Growing- date between the start and the peak with sufficient cases and deaths for cfr calculation
    • End-max date
  7. 4 different lists of dfs are created
    a. Real time outbreak during growing phase (1) and during declining phase (2): using function subset_outbreak_phase, which takes a random date between start and peak for 1, and between peak and end for 2, and uses this date as a cut-off point for the real-time analysis.
    b. Real-time outbreak at the peak (3): using a for loop to cut the df at the row that says "peak"
    c. (4) For the retrospective-cohort approach, the real-time cut-off is between the "growing" date and the peak date, to ensure there will be enough cases so that the cfr calculation makes sense (i.e., no 0 deaths/cases or deaths/0 cases)
  8. Function estimate_cfr which takes specified columns from data frame to sum(deaths)/sum(cases) and creates a vector with the cfr per dataframe
  9. Function cfr_ci to take cfr numeric vector and estimate mean + 95% CI

Copy link
Member

@joshwlambert joshwlambert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @CarmenTamayo. I've left a few comments through the script and on some functions. Happy to discuss any comments that aren't clear. This is just a first run through the code. I'll take a look again soon and provide more detailed comments.

Some additional notes, if you're going to write custom functions for this analysis it would be good to add {roxygen2} documentation blocks and unit tests. I'd be happy to contribute to these.

)

ebola_ip_meansd <- c(mean = 9.4, sd = 5.5)
# From https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4560911/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this in {epireview}, if not would be good to add to the {epiparameter} database?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually not in {epireview}, I found it online and it's not parameterised, is it worth adding to the epiparameter db? if so, it'd be helpful if you could run me through the process to make sure I do it properly


# Function rum_ebola_simulations uses simulist to create a list with n=100 (for now) linelists
set.seed(487593478)
simulations_ebola <- run_ebola_simulations(100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the run_ebola_simulations() function can be removed and replaced with an lapply(), which is good as it is less code to maintain as you run the analyses.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean doing something like this:

simulations_ebola <- lapply(1:100, function(x){ sim_linelist( contact_distribution = contact_distribution, infectious_period = ip_ebola_epidist, onset_to_hosp = NULL, hosp_risk = NULL, hosp_death_risk = NULL, onset_to_death = o_d_ebola_epidist, onset_to_recovery = o_r_ebola_epidist, prob_infect = 0.7, outbreak_size = c(500, 2000), non_hosp_death_risk = 0.3 ) })

I've tried doing lapply(1:100, sim_linelist, arguments) but it wouldn't let me, should this work?

Comment on lines +11 to +15
for (old_name in names(rename_vec)) {
if (old_name %in% colnames(incidence_data)) {
colnames(incidence_data)[colnames(incidence_data) == old_name] <- rename_vec[[old_name]]
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this can be made much more efficient by vectorising the renaming. It will also require much less code than currently written. Happy to discuss.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to change it but I'm not 100% sure of what you mean, we can discuss it or you can suggest the changes and I'll accept them

# Incidence object from linelist has 3 count_variables, but we want the column
# date_onset only to be considered to estimate the peak
id_outbreak_peak <- function(linelist) {
inc_data <- linelist %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function remakes an <incidence> object, but one is already created in the analysis script before calling this function. Might be worth seeing the <incidence> already made can be passed into this function and save yourself a step.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I meant the other day about having problems with this step, the thing is that in order to be able to use incidence2's function to estimate the outbreak peak, you need to start off an incidence object, and then it takes the date_index column to give you the peak. The problem is that when there are more than 1 elements in date_index, it calculates the peak for each of them, not just for the onset date (in my case, also for the deaths and recoveries). I tried starting off the previous incidence object that I use in the analyses and remove the rows in date_index that said "date_death" and "date_recovery" first so it would only take into account onset dates, but I didn't find it easy to do at the time and I was getting some unexpected results, so I decided it was best to start off the "correct" incidence object
If there is an easier way to do this I'm more than happy to change it

@@ -0,0 +1,8 @@
# Function to calculate cfr by specifying columns from a dataframe with
# deaths and cases
estimate_cfr <- function(df, numerator, denominator) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the {cfr} package not calculate this when a delay distribution is not supplied?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it can, initially I wasn't using it for 2 reasons, 1/ because the output includes the 95 CI and I just wanted the CFR value, so I thought it would be easier to write a function that is doing exactly what I want it to do, 2/ because I wanted the package to be used just for the comparison of the methods, but I guess I could still use it anyway if a delay distribution is not being used... do you think it would be better?

Comment on lines +14 to +15
library(data.table)
library(zoo)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are {data.table} and {zoo} used? It's personal preference, but you could namespace some functions (e.g. pkgname::func()).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they were used previously but not in the final version, I'll remove them
and do you mean whenever I use a package to include the package name?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants