Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lambda parameter for q-value estimation in enrichment script #1953

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

ivagljiva
Copy link
Contributor

This PR addresses an issue discovered here, whereby very small p-values would lead to an ugly seq bug like this:

Error in seq.default(0.05, max_lambda, 0.05) :
  wrong sign in 'by' argument

(which can happen because the input file is too short, due to the fact that we remove all core functions/annotations when we generate the input file).

There are two new aspects of the code as of these changes:

  1. When the max_lambda is too small, we stop the script and print a nicer error for the user:
Error: Unfortunately, the maximum lambda for q-value estimation is < 0.05. This
         value is coming from the p-values in the enrichment test, so something in
         your data is making those p-values low. This sometimes happens when you
         have a very short input file (your input file has 4 rows). Regardless,
         a possible solution is for you to pick your own max lambda that is >= 0.05,
         and provide it to the program calling this script by using the
         --qlambda parameter.
  1. The enrichment script now accepts a --qlambda parameter allowing the user to set max_lambda themselves. This optional parameter has been added to the program anvi-compute-functional-enrichment-across-genomes.

On my test data (Enterococcus genomes from the Infant Gut Tutorial, using the annotation source COG20_PATHWAY which only has 3 annotations differentially present across the genomes), using the --qlambda parameter allows the script to proceed through the q-value estimation, but it still ends in this later error:

Error: Doh! We still can't estimate the proportion of features that are truly null. It's Amy's fault, so please let her know ASAP so she can fix it!

Which I suspect is happening because the input file is just too darn small. So even the addition of this parameter doesn't really fix this issue. But at least now users have a bit more control.

@adw96, could you please take a look at these changes and let me know what you think? If you like them, I'll update the documentation and merge the code :)

@adw96
Copy link
Contributor

adw96 commented Jul 9, 2022

@ivagljiva Thanks for your work on this!

As an alternative to erroring when we can't estimate lambda is to return the rest of the results (with p-values) but not the q-values. I think a better approach is to provide the users with a warning that we couldn't do FDR correction (maybe with q-value = NA for all entries so they don't miss the warning) but not an error.

Essentially I think we skip over

qvalues_df <- pvalues_df %>%
  mutate("adjusted_q_value" = qvalue::qvalue(p=unadjusted_p_value, lambda=lambdas, pi0.method="smoother")$qvalues)

(or replace it with qvalues_df <- pvalues_df %>% mutate("adjusted_q_value" = NA) so we have the same number of columns always)

if the "try-error" %in% class(pi0_est) is TRUE.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants