Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: add whitelist to pre-processing #84

Open
dhoconno opened this issue Dec 1, 2022 · 2 comments
Open

Enhancement: add whitelist to pre-processing #84

dhoconno opened this issue Dec 1, 2022 · 2 comments

Comments

@dhoconno
Copy link

dhoconno commented Dec 1, 2022

Thanks for this very useful workflow. To reduce runtimes, can I suggest adding a 'whitelist' rule to preprocessing? This could reduce runtimes considerably in situations where the targets are limited (e.g., only interested in known human viruses).

I think the implementation could be straightforward:

  • Add a config option to specify a FASTA/Q file of sequences to whitelist.
  • In the preprocessing rule files, duplicate the existing host_removal_mapping rule to whitelist_read_mapping or equivalent
  • Instead of excluding mapped reads with samtools view -f 4..., the duplicated rule would map reads to the whitelist and retain only mapped reads with samtools view -F 4.

Thanks for your consideration!

@beardymcjohnface
Copy link
Collaborator

Hi,
This is an interesting suggestion. Do you think having an option to use a custom primary database for the viruses of interest would work? The primary searches do essentially what you're suggesting, but for all viruses, and the secondary multi-kingdom searches weed out the false positives from this reduced pool of sequences.

@dhoconno
Copy link
Author

dhoconno commented Dec 2, 2022

Yep, absolutely. Depending on how much database prep is needed for the primary database, I could envision situations where providing a FASTA whitelist file would be simpler and wouldn't require modifying the virus database. If the primary database is already just a FASTA file of all viruses, then specifying a custom FASTA file of, say, all human viruses would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants