You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've done quite well at cutting down the compute demands of the pipeline, but there are still several steps that are limiting on runtime (e.g. 1, 2, 3). Do a thorough survey of all major pipeline workflows and eliminate major runtime bottlenecks, especially those that are not especially compute-hungry.
The text was updated successfully, but these errors were encountered:
Based on @harmonbhasin's analysis, these seem like the biggest clocktime bottlenecks right now:
CONCAT_GZIPPED -- This is dealt with in another issue and is already mostly resolved by @harmonbhasin.
FASTQC -- This is a widely-used tool and probably pretty well-optimized, but it sure is eating up a lot of runtime. It's also doing a bunch of things we don't need, so it's possible we can replace it with custom code that just does those things. But this sounds like a big project and isn't worth prioritizing now, especially since running FASTQC doesn't block the main parts of the pipeline.
Trimmomatic - Even only running on putative HV reads from bbduk, this tool can take up a lot of time -- and unlike FASTQC, that does block the rest of the HV pipeline. Definitely worth seeing if we can make it redundant by configuring other, faster adapter-removal tools.
SUBSET_READS_PAIRED_TARGET -- This is currently implemented with seqtk. Since it's running on the entire dataset it's possible this is just irreducibly slow, but it's worth investigating whether alternative tools could do the same thing while lopping off some runtime.
LABEL_KRAKEN_REPORTS -- This is a dumb single-threaded R script that's doing something that could definitely be done faster with awk or similar.
These others are dumb serial scripts that we should probably replace at some point but don’t seem like a huge problem right now, in that they aren't eating too much clocktime or compute when run on large datasets in the current pipeline:
We've done quite well at cutting down the compute demands of the pipeline, but there are still several steps that are limiting on runtime (e.g. 1, 2, 3). Do a thorough survey of all major pipeline workflows and eliminate major runtime bottlenecks, especially those that are not especially compute-hungry.
The text was updated successfully, but these errors were encountered: