Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify and eliminate runtime bottlenecks #50

Closed
willbradshaw opened this issue Sep 18, 2024 · 2 comments
Closed

Identify and eliminate runtime bottlenecks #50

willbradshaw opened this issue Sep 18, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request priority_1

Comments

@willbradshaw
Copy link
Contributor

We've done quite well at cutting down the compute demands of the pipeline, but there are still several steps that are limiting on runtime (e.g. 1, 2, 3). Do a thorough survey of all major pipeline workflows and eliminate major runtime bottlenecks, especially those that are not especially compute-hungry.

@willbradshaw willbradshaw added enhancement New feature or request priority_1 labels Sep 18, 2024
@willbradshaw
Copy link
Contributor Author

Based on @harmonbhasin's analysis, these seem like the biggest clocktime bottlenecks right now:

  • CONCAT_GZIPPED -- This is dealt with in another issue and is already mostly resolved by @harmonbhasin.
  • FASTQC -- This is a widely-used tool and probably pretty well-optimized, but it sure is eating up a lot of runtime. It's also doing a bunch of things we don't need, so it's possible we can replace it with custom code that just does those things. But this sounds like a big project and isn't worth prioritizing now, especially since running FASTQC doesn't block the main parts of the pipeline.
  • Trimmomatic - Even only running on putative HV reads from bbduk, this tool can take up a lot of time -- and unlike FASTQC, that does block the rest of the HV pipeline. Definitely worth seeing if we can make it redundant by configuring other, faster adapter-removal tools.
  • SUBSET_READS_PAIRED_TARGET -- This is currently implemented with seqtk. Since it's running on the entire dataset it's possible this is just irreducibly slow, but it's worth investigating whether alternative tools could do the same thing while lopping off some runtime.
  • LABEL_KRAKEN_REPORTS -- This is a dumb single-threaded R script that's doing something that could definitely be done faster with awk or similar.

These others are dumb serial scripts that we should probably replace at some point but don’t seem like a huge problem right now, in that they aren't eating too much clocktime or compute when run on large datasets in the current pipeline:

  • JOIN_FASTQ
  • EXTRACT_UNCONC_READ_IDS
  • EXTRACT_UNCONC_READS
  • PROCESS_BOWTIE2_SAM_PAIRED
  • PROCESS_KRAKEN_HV
  • LABEL_BRACKEN_REPORTS

@harmonbhasin harmonbhasin self-assigned this Oct 3, 2024
@willbradshaw
Copy link
Contributor Author

Most of these now have their own issues: #129, #127, #128, #74, #122, #131. Closing this one as redundant and overbroad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority_1
Projects
None yet
Development

No branches or pull requests

2 participants