Skip to content
This repository has been archived by the owner on Jan 11, 2022. It is now read-only.

add e-value threshold to require internal alignments have e-value < 1 #309

Merged
merged 6 commits into from
Jun 12, 2020

Conversation

katrinakalantar
Copy link
Contributor

@katrinakalantar katrinakalantar commented Jun 10, 2020

Description

issue: For amplicon sequencing libraries (i.e. sars-cov-2 artic v3 libraries) there were false-positive hits to taxa with e-values in the thousands.

solution: Standard practice is to apply an e-value threshold to ensure that alignments are high-quality. IDseq seeks to maintain high sensitivity, but reduce false-positives with exceptionally high e-values. Thus, a relatively conservative e-value threshold of 1 was implemented to reduce the number of obvious false-positives while still maintaining sensitivity to detect novel organisms.

Version

Tests

  • I have verified that the pipeline still completes successfully:
    • for single-end inputs
    • for paired-end inputs
    • for FASTQ inputs
    • for FASTA inputs.
  • I have validated that my change does not introduce any correctness bugs to existing output types.
  • I have validated that my change does not introduce significant performance regressions or I have discussed with the team that the benefits of the change are substantial enough that we're comfortable accepting the size of the measured performance penalty.

Notes

  • I have verified using the benchmark sample, that there is no change in accuracy at species/genus/family levels
  • I have verified that intermediate alignment files (gsnap.deduped.m8, gsnap.blast.top.m8, rapsearch2.deduped.m8, rapsearch2.blast.top.m8) do not contain alignment results with e-values greater than the specified threshold.

@katrinakalantar katrinakalantar changed the title add e-value threshold to require internal alignments have e-value > 1 add e-value threshold to require internal alignments have e-value < 1 Jun 10, 2020
Copy link
Contributor

@kislyuk kislyuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but before merging please fix the linter error reported in the Checks section.

Also, we should describe the unit test strategy for this code. The easiest way to do so is to open an issue in https://github.com/chanzuckerberg/idseq-workflows requesting a new test case, and indicate the ID of a staging sample where this was tested, the name of a step/task where this code is called, and the assertions to be made about the output.

@@ -67,7 +71,7 @@
MIN_CONTIG_SIZE = 4


def parse_tsv(path, schema, expect_headers=False, raw_lines=False):
def parse_tsv(path, schema, expect_headers=False, raw_lines=False, min_alignment_length=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of adding the min_alignment_length kwarg here? It doesn't seem to be used within.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a hold-over from initial work to attempt the "TODO: Deprecate this iterate_m8() function...". It is no longer relevant, as I took a different approach for the e-value filter. Will remove!

@katrinakalantar
Copy link
Contributor Author

Added unit test description here: chanzuckerberg/idseq-workflows#7

@katrinakalantar katrinakalantar merged commit 70ee443 into master Jun 12, 2020
@katrinakalantar katrinakalantar deleted the kkalantar/evalue-threshold branch June 12, 2020 17:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants