-
Notifications
You must be signed in to change notification settings - Fork 12
add e-value threshold to require internal alignments have e-value < 1 #309
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but before merging please fix the linter error reported in the Checks section.
Also, we should describe the unit test strategy for this code. The easiest way to do so is to open an issue in https://github.com/chanzuckerberg/idseq-workflows requesting a new test case, and indicate the ID of a staging sample where this was tested, the name of a step/task where this code is called, and the assertions to be made about the output.
idseq_dag/util/m8.py
Outdated
@@ -67,7 +71,7 @@ | |||
MIN_CONTIG_SIZE = 4 | |||
|
|||
|
|||
def parse_tsv(path, schema, expect_headers=False, raw_lines=False): | |||
def parse_tsv(path, schema, expect_headers=False, raw_lines=False, min_alignment_length=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of adding the min_alignment_length
kwarg here? It doesn't seem to be used within.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a hold-over from initial work to attempt the "TODO: Deprecate this iterate_m8() function...". It is no longer relevant, as I took a different approach for the e-value filter. Will remove!
Added unit test description here: chanzuckerberg/idseq-workflows#7 |
Description
issue: For amplicon sequencing libraries (i.e. sars-cov-2 artic v3 libraries) there were false-positive hits to taxa with e-values in the thousands.
solution: Standard practice is to apply an e-value threshold to ensure that alignments are high-quality. IDseq seeks to maintain high sensitivity, but reduce false-positives with exceptionally high e-values. Thus, a relatively conservative e-value threshold of 1 was implemented to reduce the number of obvious false-positives while still maintaining sensitivity to detect novel organisms.
Version
vX.Y.Z
Tests
Notes