add e-value threshold to require internal alignments have e-value < 1 #309

katrinakalantar · 2020-06-10T17:42:36Z

Description

issue: For amplicon sequencing libraries (i.e. sars-cov-2 artic v3 libraries) there were false-positive hits to taxa with e-values in the thousands.

solution: Standard practice is to apply an e-value threshold to ensure that alignments are high-quality. IDseq seeks to maintain high sensitivity, but reduce false-positives with exceptionally high e-values. Thus, a relatively conservative e-value threshold of 1 was implemented to reduce the number of obvious false-positives while still maintaining sensitivity to detect novel organisms.

Version

I have increased the appropriate version number in https://github.com/chanzuckerberg/idseq-dag/blob/master/idseq_dag/__init__.py. Guidelines here: https://github.com/chanzuckerberg/idseq-dag/blob/pr-template/README.md#release-notes
I have added release notes for my new version to https://github.com/chanzuckerberg/idseq-dag/blob/master/README.md#release-notes
I will push a git tag after merging in the form vX.Y.Z

Tests

I have verified that the pipeline still completes successfully:
- for single-end inputs
- for paired-end inputs
- for FASTQ inputs
- for FASTA inputs.
I have validated that my change does not introduce any correctness bugs to existing output types.
I have validated that my change does not introduce significant performance regressions or I have discussed with the team that the benefits of the change are substantial enough that we're comfortable accepting the size of the measured performance penalty.

Notes

I have verified using the benchmark sample, that there is no change in accuracy at species/genus/family levels
I have verified that intermediate alignment files (gsnap.deduped.m8, gsnap.blast.top.m8, rapsearch2.deduped.m8, rapsearch2.blast.top.m8) do not contain alignment results with e-values greater than the specified threshold.

kislyuk

LGTM but before merging please fix the linter error reported in the Checks section.

Also, we should describe the unit test strategy for this code. The easiest way to do so is to open an issue in https://github.com/chanzuckerberg/idseq-workflows requesting a new test case, and indicate the ID of a staging sample where this was tested, the name of a step/task where this code is called, and the assertions to be made about the output.

kislyuk · 2020-06-12T15:40:26Z

idseq_dag/util/m8.py

@@ -67,7 +71,7 @@
 MIN_CONTIG_SIZE = 4


-def parse_tsv(path, schema, expect_headers=False, raw_lines=False):
+def parse_tsv(path, schema, expect_headers=False, raw_lines=False, min_alignment_length=0):


What is the purpose of adding the min_alignment_length kwarg here? It doesn't seem to be used within.

This was a hold-over from initial work to attempt the "TODO: Deprecate this iterate_m8() function...". It is no longer relevant, as I took a different approach for the e-value filter. Will remove!

katrinakalantar · 2020-06-12T17:54:57Z

Added unit test description here: chanzuckerberg/idseq-workflows#7

katrinakalantar added 4 commits June 3, 2020 15:39

add evalue_threshold parameter to m8 parsing function

8714d5f

add evalue filters to blast results

f3cc029

fix max_evalue_threshold variable name

c9f1cae

bump version and add changelog to README

5e4f5c8

katrinakalantar requested review from kislyuk, tfrcarvalho and cdebourcy June 10, 2020 17:44

katrinakalantar changed the title ~~add e-value threshold to require internal alignments have e-value > 1~~ add e-value threshold to require internal alignments have e-value < 1 Jun 10, 2020

kislyuk approved these changes Jun 12, 2020

View reviewed changes

katrinakalantar and others added 2 commits June 12, 2020 09:54

remove min_alignment_length kwag where not used

e7e58a6

Fix lint error

0bd6d14

katrinakalantar mentioned this pull request Jun 12, 2020

add test cases for maximum e-value filter on alignment results chanzuckerberg/idseq-workflows#7

Open

katrinakalantar merged commit 70ee443 into master Jun 12, 2020

katrinakalantar deleted the kkalantar/evalue-threshold branch June 12, 2020 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add e-value threshold to require internal alignments have e-value < 1 #309

add e-value threshold to require internal alignments have e-value < 1 #309

katrinakalantar commented Jun 10, 2020 •

edited

Loading

kislyuk left a comment

kislyuk Jun 12, 2020

katrinakalantar Jun 12, 2020

katrinakalantar commented Jun 12, 2020

add e-value threshold to require internal alignments have e-value < 1 #309

add e-value threshold to require internal alignments have e-value < 1 #309

Conversation

katrinakalantar commented Jun 10, 2020 • edited Loading

Description

Version

Tests

Notes

kislyuk left a comment

Choose a reason for hiding this comment

kislyuk Jun 12, 2020

Choose a reason for hiding this comment

katrinakalantar Jun 12, 2020

Choose a reason for hiding this comment

katrinakalantar commented Jun 12, 2020

katrinakalantar commented Jun 10, 2020 •

edited

Loading