Identify and eliminate runtime bottlenecks #50

willbradshaw · 2024-09-18T14:59:54Z

We've done quite well at cutting down the compute demands of the pipeline, but there are still several steps that are limiting on runtime (e.g. 1, 2, 3). Do a thorough survey of all major pipeline workflows and eliminate major runtime bottlenecks, especially those that are not especially compute-hungry.

willbradshaw · 2024-10-03T15:33:37Z

Based on @harmonbhasin's analysis, these seem like the biggest clocktime bottlenecks right now:

CONCAT_GZIPPED -- This is dealt with in another issue and is already mostly resolved by @harmonbhasin.
FASTQC -- This is a widely-used tool and probably pretty well-optimized, but it sure is eating up a lot of runtime. It's also doing a bunch of things we don't need, so it's possible we can replace it with custom code that just does those things. But this sounds like a big project and isn't worth prioritizing now, especially since running FASTQC doesn't block the main parts of the pipeline.
Trimmomatic - Even only running on putative HV reads from bbduk, this tool can take up a lot of time -- and unlike FASTQC, that does block the rest of the HV pipeline. Definitely worth seeing if we can make it redundant by configuring other, faster adapter-removal tools.
SUBSET_READS_PAIRED_TARGET -- This is currently implemented with seqtk. Since it's running on the entire dataset it's possible this is just irreducibly slow, but it's worth investigating whether alternative tools could do the same thing while lopping off some runtime.
LABEL_KRAKEN_REPORTS -- This is a dumb single-threaded R script that's doing something that could definitely be done faster with awk or similar.

These others are dumb serial scripts that we should probably replace at some point but don’t seem like a huge problem right now, in that they aren't eating too much clocktime or compute when run on large datasets in the current pipeline:

JOIN_FASTQ
EXTRACT_UNCONC_READ_IDS
EXTRACT_UNCONC_READS
PROCESS_BOWTIE2_SAM_PAIRED
PROCESS_KRAKEN_HV
LABEL_BRACKEN_REPORTS

willbradshaw · 2024-12-17T19:40:40Z

Most of these now have their own issues: #129, #127, #128, #74, #122, #131. Closing this one as redundant and overbroad.

willbradshaw added enhancement New feature or request priority_1 labels Sep 18, 2024

harmonbhasin self-assigned this Oct 3, 2024

willbradshaw closed this as completed Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify and eliminate runtime bottlenecks #50

Identify and eliminate runtime bottlenecks #50

willbradshaw commented Sep 18, 2024

willbradshaw commented Oct 3, 2024

willbradshaw commented Dec 17, 2024

Identify and eliminate runtime bottlenecks #50

Identify and eliminate runtime bottlenecks #50

Comments

willbradshaw commented Sep 18, 2024

willbradshaw commented Oct 3, 2024

willbradshaw commented Dec 17, 2024