Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove normal approximation #153

Merged
merged 10 commits into from
Jul 17, 2024
Merged

Remove normal approximation #153

merged 10 commits into from
Jul 17, 2024

Conversation

adamkucharski
Copy link
Member

@adamkucharski adamkucharski commented Jun 24, 2024

  • Please check if the PR fulfills these requirements
  • I have read the CONTRIBUTING guidelines
  • A new item has been added to NEWS.md
  • Tests for the changes have been added (for bug fixes / features)
  • Docs have been added / updated (for bug fixes / features)
  • Checks have been run locally and pass
  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...)

This addresses issues #152 and #151

  • What is the current behavior? (You can also link to an open issue here)

Instability with normal approximation in Ebola example

  • What is the new behavior (if this is a feature change)?

Normal approximation removed. This PR also updates tests for consistency with the removed functionality.

There are two additional changes:

  • Example in the README now focuses on the early outbreak stage, where the CFR bias is greater and hence the importance of {cfr} functionality is more clearly illustrated.
  • When we have the condition expected outcomes to date < total deaths, we now return NA for all estimates, to make it clearer to the user that this would not be a statistically valid calculation.
  • Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

No.

Given instability of the normal approximation for many values (especially given asymmetric likelihood), and because binomial implementation is quick, this is being removed to ensure accurate outputs.
Output NA if total_outcomes<=total_deaths
Focus on early stage
@adamkucharski adamkucharski marked this pull request as draft June 24, 2024 19:59
@adamkucharski
Copy link
Member Author

Note: need to review the test functions to ensure consistency with updated messages and outputs.

@adamkucharski adamkucharski marked this pull request as ready for review July 1, 2024 07:58
@adamkucharski adamkucharski marked this pull request as draft July 2, 2024 10:49
@adamkucharski adamkucharski marked this pull request as ready for review July 2, 2024 10:49
@Bisaloo
Copy link
Member

Bisaloo commented Jul 11, 2024

Could you run styler::style_pkg() in the folder containing the package please? This will ensure consistency in the indentation and make it easier to read & review the code.

@adamkucharski
Copy link
Member Author

I've run styler::style_pkg() (latest commit). Is this something we should also incorporate into this issue on PR guidelines? Would it also make easier for users to pass the linter checks?

Knitted vignettes appear to show equations OK once removed.
@avallecam
Copy link
Member

sorry for the delayed reply. Just created this reprex to reproduce the README figure.

This PR successfully solved the issue of the point estimate and confidence interval. I would only add that this reprex also diagnoses that there are sections of the time series that do not generate an output, generating no estimates for certain date ranges, which are visible in the plot.

# pak::pak("epiverse-trace/cfr@remove-normal")

# Load package
library(cfr)
library(ggplot2)

# Calculate the static CFR without correcting for delays
cfr_static(data = ebola1976)
#>   severity_estimate severity_low severity_high
#> 1          0.955102    0.9210866     0.9773771


# Calculate the CFR without correcting for delays on each day of the outbreak
rolling_cfr_naive <- cfr_rolling(
  data = ebola1976
)
#> `cfr_rolling()` is a convenience function to help understand how additional data influences the overall (static) severity. Use `cfr_time_varying()` instead to estimate severity changes over the course of the outbreak.


# Calculate the rolling daily CFR while correcting for delays
rolling_cfr_corrected <- cfr_rolling(
  data = ebola1976,
  delay_density = function(x) dgamma(x, shape = 2.40, scale = 3.33)
)
#> `cfr_rolling()` is a convenience function to help understand how additional data influences the overall (static) severity. Use `cfr_time_varying()` instead to estimate severity changes over the course of the outbreak.
#> Some daily ratios of total deaths to total cases with known outcome are below 0.01%: some CFR estimates may be unreliable.FALSE

# combine the data for plotting
rolling_cfr_naive$method <- "naive"
rolling_cfr_corrected$method <- "corrected"

data_cfr <- rbind(
  rolling_cfr_naive,
  rolling_cfr_corrected
)

# visualise both corrected and uncorrected rolling estimates
ggplot(data_cfr) +
  geom_ribbon(
    aes(
      date,
      ymin = severity_low, ymax = severity_high,
      fill = method
    ),
    alpha = 0.2, show.legend = FALSE
  ) +
  geom_line(
    aes(date, severity_estimate, colour = method)
  ) +
  scale_colour_brewer(
    palette = "Dark2",
    labels = c("Corrected CFR", "Naive CFR"),
    name = NULL
  ) +
  scale_fill_brewer(
    palette = "Dark2"
  )
#> Warning: Removed 1 row containing missing values or values outside the scale range
#> (`geom_line()`).

Created on 2024-07-11 with reprex v2.1.0

Copy link
Member

@avallecam avallecam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case NA may not be an expected outcome, we can use this reprex to check this behaviour after the corresponding fix. Here are the cfr_rolling and two cfr_static outputs with data until specific dates.

# pak::pak("epiverse-trace/cfr@remove-normal")

# Load package
library(cfr)
library(dplyr)
library(lubridate)

cfr_rolling(
  data = ebola1976,
  delay_density = function(x) dgamma(x, shape = 2.40, scale = 3.33)
) %>% 
  filter(is.na(severity_estimate))
#> `cfr_rolling()` is a convenience function to help understand how additional data influences the overall (static) severity. Use `cfr_time_varying()` instead to estimate severity changes over the course of the outbreak.
#> Some daily ratios of total deaths to total cases with known outcome are below 0.01%: some CFR estimates may be unreliable.FALSE
#>          date severity_estimate severity_low severity_high
#> 1  1976-08-25                NA           NA            NA
#> 2  1976-09-28                NA           NA            NA
#> 3  1976-09-29                NA           NA            NA
#> 4  1976-09-30                NA           NA            NA
#> 5  1976-10-01                NA           NA            NA
#> 6  1976-10-02                NA           NA            NA
#> 7  1976-10-03                NA           NA            NA
#> 8  1976-10-04                NA           NA            NA
#> 9  1976-10-05                NA           NA            NA
#> 10 1976-10-06                NA           NA            NA
#> 11 1976-10-07                NA           NA            NA
#> 12 1976-10-08                NA           NA            NA
#> 13 1976-10-15                NA           NA            NA

ebola1976 %>% 
  filter(date<=ymd(19761001)) %>% 
  cfr_static(delay_density = function(x) dgamma(x, shape = 2.40, scale = 3.33))
#> Total deaths = 140 and expected outcomes = 134 so setting expected outcomes = NA. If we were to assume
#>         total deaths = expected outcomes, it would produce an estimate of 1.
#>   severity_estimate severity_low severity_high
#> 1                NA           NA            NA

ebola1976 %>% 
  filter(date<=ymd(19761015)) %>% 
  cfr_static(delay_density = function(x) dgamma(x, shape = 2.40, scale = 3.33))
#> Total deaths = 214 and expected outcomes = 214 so setting expected outcomes = NA. If we were to assume
#>         total deaths = expected outcomes, it would produce an estimate of 1.
#>   severity_estimate severity_low severity_high
#> 1                NA           NA            NA

Created on 2024-07-11 with reprex v2.1.0

@adamkucharski
Copy link
Member Author

adamkucharski commented Jul 12, 2024

Thanks for looking at this. Currently we have some situations where E(known outcomes) < deaths, and hence the likelihood as currently implemented isn't totally valid. One option would be to set E(known outcomes) = deaths in this situation (which some earlier versions of the code did). But thinking more, seemed clearest to return NA in this situation. I've outlined a potential longer-term solution in issue #154 (but this would require some more thought and potentially quite a lot of refactoring).

@avallecam
Copy link
Member

That is interesting; thank you for adding an explicit explanation. In the meantime, would it be valid to get the message as a warning and, instead of NA, provide the latest or most recent estimated output at a given date?

@adamkucharski
Copy link
Member Author

It would require quite a lot of additional refactoring to output the last valid estimate, as the README example is a showcase of multiple estimates at each point in time in a visualisation (i.e. a loop over cfr_static() ), rather than the output user will commonly interact with (i.e. a numerical estimate).

The current message is displayed above (e.g. Total deaths = 140 and expected outcomes = 134 so setting expected outcomes = NA. If we were to assume total deaths = expected outcomes, it would produce an estimate of 1.) But could edit if there's a better option?

@avallecam
Copy link
Member

avallecam commented Jul 15, 2024

The current message is displayed above (e.g. Total deaths = 140 and expected outcomes = 134 so setting expected outcomes = NA. If we were to assume total deaths = expected outcomes, it would produce an estimate of 1.) But could edit if there's a better option?

The current message displayed is appropriate and specific. This reflects that this is produced in an extreme scenario, as described in #154. Given that this PR already solved the key issue, I'll move on with the approval.

As a complementary comment, I suggest adding to the message an explicit next step for the user. If, in an ongoing outbreak, we create a reproducible sitrep and suddenly get an NA result, then we can have a companion plot from cfr_rolling() to have the full view.

We could add sth like:

Use `cfr_rolling()` to understand how additional data influences the overall (static) severity.

Copy link
Member

@avallecam avallecam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 ready to merge given that this PR solved the main issue.

As commented in #153 (comment) this could optionally consider adding a next step in the output message for NA outputs from an extreme situation, as evaluated with data from ebola 1976, and until #154 manages to be solved.

@adamkucharski adamkucharski merged commit 29ee12a into main Jul 17, 2024
8 checks passed
@adamkucharski adamkucharski deleted the remove-normal branch July 17, 2024 09:38
@adamkucharski
Copy link
Member Author

Thanks, have merged and will create an issue with the above rolling suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants