Add umicollapse and benchmark it against umitools dedup #1087

MatthiasZepper · 2023-10-05T13:37:18Z

Description of feature

umicollapse is a Java tool and nf-core module so far not used by any pipeline.

It was specifically written as a more performant tool, which supposedly produces output equivalent to umi-tools dedup, but with significantly shorter runtimes (minutes instead of hours).

Therefore, it might make sense to switch from umi-tools dedup to umicolllapse, or at least offer it as an alternative route, similar to the coexistence of TrimGalore / FastP in the future.

To advance this development, please:

Familiarize yourself with umicollapse and its settings.
Create a new branch of the pipeline in your fork and add the umicollapse module to it, replacing the umi-tools dedup step. Update the module config accordingly.
Benchmark the modified pipeline against version 3.12 when run on data with UMIs, comparing runtimes and quantification results. Possible benchmarking datasets are e.g. GSE207129 or GSE179992 with those parameters
```
with_umi: true 
umitools_extract_method: "regex"
umitools_bc_pattern2: "^(?P<umi_1>.{8})(?P<discard_1>.{6}).*"
skip_markDuplicates: True
```
or Experiment 2 of Series GSE171427 when run with
```
   with_umi : true
  umitools_extract_method : 'regex'
  umitools_bc_pattern : "^(?P<umi_1>.{12}).*"
  skip_markDuplicates: true
  clip_r2: 9
```
I think, this is a suitable Hackathon task for a more advanced nf-core contributor. Thanks!

The text was updated successfully, but these errors were encountered:

CharlotteAnne · 2023-10-10T07:00:29Z

Hi there, I wrote the module! We are using it in our updated nf-core/clipseq pipeline (not released yet). It is able to handle much larger bams and I have benchmarked that it gives identical results. So hopefully this should be an easy issue.

MatthiasZepper · 2023-10-10T11:23:30Z

Great work and thank you so much for the info!

So then this issue comes down to writing a respective subworkflow similar to bam_dedup_stats_samtools_umitools, but using umicollapse instead.

Maybe @drpatelh can give a hint, if he prefers replacing umi-tools dedup or leaving it to the user to choose in that case?

drpatelh · 2023-10-11T18:26:25Z

Be interested to see what your implementation looks like @CharlotteAnne ! But adding an independent subworkflow either way as you suggested @MatthiasZepper is a good idea. We can then decide how we slip it into the pipeline.

CharlotteAnne · 2023-10-11T18:40:42Z

you can see in the https://github.com/nf-core/clipseq/tree/feat-2-0 branch of clipseq. we dont used the subworkflow because the samtools stats clutters up our pipeline output without giving any real useful information

CharlotteAnne · 2023-10-11T18:41:55Z

actually thats a lie lol at the moment we do https://github.com/nf-core/clipseq/blob/feat-2-0/subworkflows/local/bam_dedup_samtools_umicollapse.nf but im gonna refactor becuz of samtools stats clutter

ctuni · 2023-11-23T12:37:57Z

Hello! After a conversation with @MatthiasZepper over Slack I decided I would get hands-on on this issue. I've already created a branch on a fork of the repo and added the functionality for the user to select whether they want to deduplicate using umitools or umicollapse. Then I added the umicollapse module to the pipeline and @CharlotteAnne 's subworkflow from clipseq.

I have not made the pull request yet because I have not tested it and benchmarked, but feel free to "read" the changes in the code and give all the feedback you want. I'll keep you posted with the results once I have tested :)

ctuni · 2023-11-24T09:55:55Z

The added umicollapse process is working correctly, but I had to comment out the line where the .command.log is copied to a file because I got an error about the file not existing... And also had to comment out the lines referencing that log file downstream. Maybe @CharlotteAnne has some idea about how to correctly generate the log for the process? I think a solution could be to pipe the STDOUT of the umicollapse command since it's very informative to the log file.

I'm testing only that it currently runs and completes with two samples of the first dataset linked, subsampled to 10k reads (I'm testing it locally 😅 ), but would love to really benchmark this implementation with more real data. Any suggestion or idea on how could we do this?

MatthiasZepper · 2023-11-24T14:35:44Z

@ctuni : Thank you so much for all your work and effort! Greatly appreciated! We have some sequencing data from human XpressRef Universal Total RNA in different input concentrations and kits, which I could run with your pipeline for comparison.

It will take a couple of days, but then we have a relatively large dataset with a known ground truth to compare against. I could share e.g. the salmon quant files with you for your own analysis?

MatthiasZepper · 2024-01-03T11:59:40Z

To give you a small update about my progress (or the lack thereof 😬).

I fetched @ctuni 's modification to the pipeline based on @CharlotteAnne 's subworkflow, but felt that for evaluation purposes, it would be better to compare the tools on exactly the same BAM files instead of realigning the same reads again.

Therefore, I tried to .tap() the BAM transcriptome channel, modify the meta.id for the branched samples with an additional suffix, run both tools in parallel and later mix the output channels again.

With the same approach, I planned to also get a close comparison of the effects that the Prepare for Salmon process has, since the whole undertaking started in Slack with doubts about that script: #828.

Thus, essentially getting 4 quantifications (umitools, umicollapse) x (prepare for salmon, skip prepare for salmon) for the same BAM alignment. The pipeline modified for test purposes resides in my fork, but I keep getting out of memory errors for umicollapse.

Even 256GB heap doesn't seem to be enough for a single pass run:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
  	at java.base/java.util.regex.Matcher.<init>(Matcher.java:248)
  	at java.base/java.util.regex.Pattern.matcher(Pattern.java:1134)
  	at umicollapse.util.SAMRead.getUMI(SAMRead.java:34)
  	at umicollapse.main.DeduplicateSAM.deduplicateAndMerge(DeduplicateSAM.java:109)
  	at umicollapse.main.Main.main(Main.java:221)

Therefore, I still have to play around with the Java Virtual Machine Memory settings for initial heap size, max heap size and stack size and probably set --two-pass as default argument in the module.config.

siddharthab · 2024-08-03T19:35:09Z

Thank you for this effort. Please let me know if I can help in some way. I would appreciate being able to finish the entire pipeline in under 2 hours, instead of nearly the whole day that it takes right now.

MatthiasZepper · 2024-08-14T12:09:10Z

Of course, you can help! Much appreciated!

The subworkflow with umicollapse is done, but I was holding back with integrating it into the rnaseq pipeline, since it to me seemed that release 3.15 was immanent and more urgent tasks had to be done. But since the big refactor is done, you can probably now integrate it without the risk of building on top of a heavily changing foundation with many moving parts now (*hopefully)

Hoe about assigning yourself to this task and taking over?

apeltzer · 2024-08-14T17:56:14Z

Smrnaseq has it implemented too - if you're looking for insights how it was implemented there, have a look 👍

siddharthab · 2024-08-14T17:56:15Z

Thank you! I don't have permissions to change assignments. Please do so if you can. I will send a PR shortly.

MatthiasZepper · 2024-08-15T10:07:35Z

Great! Please join the nf-core GitHub organization asap, so you can better interact with the different repos and issues! I think e.g. @drpatelh would be happy to send you an invitation, since you are contributing to the rnaseq pipeline.

siddharthab · 2024-08-21T22:27:15Z

Running a little behind on this, but still on my list, somewhere near the top.

MatthiasZepper · 2024-08-22T09:25:22Z

Running a little behind on this, but still on my list, somewhere near the top.

Please don't stress yourself over it. I have been neglecting this for months and without your help, it would probably not make it anywhere soon either. So take your time, it will anyway not be included in the next release 3.15 anymore (for which a feature freeze is essentially in place) and the 3.16 release is still a few weeks ahead!

siddharthab · 2024-09-03T06:22:34Z

WIP PR at #1369. I think we definitely have to use --two-pass, and I am not sure how good umicollapse is with paired reads.

siddharthab · 2024-09-03T18:18:02Z

Crude benchmark results from processing GSE207129. Note that these are paired reads. Execution timeline reports attached.

Some notes:

--paired uses considerably more memory and time, and processing transcriptomes is much worse than with umitools. With --two-pass, it takes 9-10 hours for umicollapse instead of ~30 minutes for umitools. Processing genomes takes roughly half the time (but with 10x more memory).
--two-pass keeps memory requirements low. For a 3.2 GB transcriptome sorted BAM file with paired reads, --two-pass takes 15 GB instead of 75 GB RAM.

Does anyone have any thoughts on what I could be doing wrong here?

Execution Timeline.zip

CharlotteAnne · 2024-09-03T18:46:36Z

I couldn't see easily - are you testing with data that has UMIs? how long are the UMIs and what is the PCR deduplication ratio of the dataset? With CLIP we found similar, that sometimes UMIcollapse can be comparably inefficient but it works for tricky large BAM files where umitools just gives up. Probably a question for the tool author.

…

On Tue, 3 Sept 2024 at 19:18, Siddhartha Bagaria ***@***.***> wrote: Crude benchmark results from processing GSE207129 <https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE207129>. Note that these are paired reads. Execution timeline reports attached. Some notes: 1. --paired uses considerably more memory and time, and processing transcriptomes is much worse than with umitools. With --two-pass, it takes 9-10 hours for umicollapse instead of ~30 minutes for umitools. Processing genomes takes roughly half the time (but with 10x more memory). 2. --two-pass keeps memory requirements low. For a 3.2 GB transcriptome sorted BAM file with paired reads, --two-pass takes 15 GB instead of 75 GB RAM. Does anyone have any thoughts on what I could be doing wrong here? Execution Timeline.zip <https://github.com/user-attachments/files/16853028/Execution.Timeline.zip> — Reply to this email directly, view it on GitHub <#1087 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFVBH3JFBKJ2GTZBW62K4FLZUX4PBAVCNFSM6AAAAABL6E574SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRXGE2DKMJRHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

siddharthab · 2024-09-03T20:13:31Z

Yes, the data has 8 bp long UMIs. This is what umi-tools says:

Reads: Input Reads: 44319354, Read pairs: 44319354
Number of reads out: 32494556
Total number of positions deduplicated: 30086237
Mean number of unique UMIs per position: 1.11
Max. number of unique UMIs per position: 147

CharlotteAnne · 2024-09-03T21:29:17Z

Ah I see! These files have many more reads than we would typically process from CLIP, but our PCR deduplication ratio is higher and the UMIs can be longer, which both also impact the memory requirement. Perhaps the smallRNASeq folk would be better able to advise and the tool author himself on his github repo.

…

On Tue, 3 Sept 2024 at 21:13, Siddhartha Bagaria ***@***.***> wrote: Yes, the data has 8 bp long UMIs. This is what umi-tools says: Reads: Input Reads: 44319354, Read pairs: 44319354 Number of reads out: 32494556 Total number of positions deduplicated: 30086237 Mean number of unique UMIs per position: 1.11 Max. number of unique UMIs per position: 147 — Reply to this email directly, view it on GitHub <#1087 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFVBH3KZYMFLRZQVIHWRVMDZUYKADAVCNFSM6AAAAABL6E574SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRXGM2TKNJQGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

siddharthab · 2024-09-03T22:19:11Z

Filed Daniel-Liu-c0deb0t/UMICollapse#31 to get a response from the tool author.

If paired ends is indeed this slow, my suggestion would be to continue to keep umi-tools the default and provide UMICollapse only as an option manually selected, probably without integration with multiqc.

siddharthab · 2024-09-03T22:22:09Z

@apeltzer Have you ever run UMICollapse in paired end mode? It seems like the command line the pipeline would form is incorrect (--method should be --algo) and so maybe UMICollapse was never tested in paired mode?

https://github.com/nf-core/smrnaseq/blob/5901bea438cb719c186f883bb6e6a11f1de70a5b/conf/modules.config#L143

siddharthab · 2024-09-04T07:20:09Z

After fixing the issue in UMICollapse, in general, UMICollapse takes ~60% of the wall time of umi-tools. We will have to wait for the fix to be accepted and a new release in bioconda.

apeltzer · 2024-09-04T08:33:45Z

Smrnaseq only runs on SE data so the PE was never used. As smrnaseq data is so short, paired end protocols do not make any sense for smrna data. For SE data we saw quite substantially better performance than what we had with umitools.

MatthiasZepper · 2024-09-04T11:58:12Z

Wow, one night of good sleep and one morning of not paying attention to GitHub and meanwhile this happened...🤯.

I have not yet reviewed your PR, but from briefly glimpsing over this conversation I already got the impression that you, @siddharthab, made an outstanding contribution! Thank you so much for your PR to the pipeline, including an extensive QC and profiling!

I am flabbergasted by your perspicaciousness - identifying the reason for the slow transcriptome deduplication in the tool's code in no-time and fixing it right away! Brilliant!

* Add UMICollapse module Fixes #2813. See nf-core/rnaseq#1087 for broader discussion. * Fix tests and lints * Fix scale name and remove unused import * Shorter col names --------- Co-authored-by: Vlad Savelyev <[email protected]>

siddharthab · 2024-10-03T23:19:50Z

@CharlotteAnne, looks like the conda recipe uses your forked repo. Would you be willing to update the recipe to use a released version from your or mine repo until the author merges the change?

MatthiasZepper · 2024-10-08T13:55:55Z

Since the Bioconda recipe anyway points to a fork of the original repo, I see no problem with changing it to your fork @siddharthab. If you could make a 1.0.1 release that includes the transcriptomic dedup bug fix, I can update the Bioconda recipe to your fork and to the correct SHA.

MatthiasZepper added the enhancement label Oct 5, 2023

ctuni self-assigned this Nov 23, 2023

MatthiasZepper self-assigned this Mar 18, 2024

MatthiasZepper removed their assignment Aug 14, 2024

MatthiasZepper assigned siddharthab Aug 15, 2024

MatthiasZepper added this to the 3.16.0 milestone Aug 21, 2024

siddharthab mentioned this issue Sep 3, 2024

Very slow paired reads mode for transcriptome Daniel-Liu-c0deb0t/UMICollapse#31

Open

apeltzer mentioned this issue Sep 4, 2024

Umicollapse algo wrong set nf-core/smrnaseq#404

Closed

siddharthab mentioned this issue Sep 4, 2024

Add UMICollapse module MultiQC/MultiQC#2814

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add umicollapse and benchmark it against umitools dedup #1087

Add umicollapse and benchmark it against umitools dedup #1087

MatthiasZepper commented Oct 5, 2023

CharlotteAnne commented Oct 10, 2023

MatthiasZepper commented Oct 10, 2023

drpatelh commented Oct 11, 2023

CharlotteAnne commented Oct 11, 2023 via email •

edited

Loading

CharlotteAnne commented Oct 11, 2023 via email •

edited

Loading

ctuni commented Nov 23, 2023 •

edited

Loading

ctuni commented Nov 24, 2023

MatthiasZepper commented Nov 24, 2023

MatthiasZepper commented Jan 3, 2024

siddharthab commented Aug 3, 2024

MatthiasZepper commented Aug 14, 2024

apeltzer commented Aug 14, 2024

siddharthab commented Aug 14, 2024

MatthiasZepper commented Aug 15, 2024

siddharthab commented Aug 21, 2024

MatthiasZepper commented Aug 22, 2024

siddharthab commented Sep 3, 2024

siddharthab commented Sep 3, 2024

CharlotteAnne commented Sep 3, 2024 via email

siddharthab commented Sep 3, 2024

CharlotteAnne commented Sep 3, 2024 via email

siddharthab commented Sep 3, 2024

siddharthab commented Sep 3, 2024

siddharthab commented Sep 4, 2024

apeltzer commented Sep 4, 2024

MatthiasZepper commented Sep 4, 2024

siddharthab commented Oct 3, 2024

MatthiasZepper commented Oct 8, 2024 •

edited

Loading

Add umicollapse and benchmark it against umitools dedup #1087

Add umicollapse and benchmark it against umitools dedup #1087

Comments

MatthiasZepper commented Oct 5, 2023

Description of feature

CharlotteAnne commented Oct 10, 2023

MatthiasZepper commented Oct 10, 2023

drpatelh commented Oct 11, 2023

CharlotteAnne commented Oct 11, 2023 via email • edited Loading

CharlotteAnne commented Oct 11, 2023 via email • edited Loading

ctuni commented Nov 23, 2023 • edited Loading

ctuni commented Nov 24, 2023

MatthiasZepper commented Nov 24, 2023

MatthiasZepper commented Jan 3, 2024

siddharthab commented Aug 3, 2024

MatthiasZepper commented Aug 14, 2024

apeltzer commented Aug 14, 2024

siddharthab commented Aug 14, 2024

MatthiasZepper commented Aug 15, 2024

siddharthab commented Aug 21, 2024

MatthiasZepper commented Aug 22, 2024

siddharthab commented Sep 3, 2024

siddharthab commented Sep 3, 2024

CharlotteAnne commented Sep 3, 2024 via email

siddharthab commented Sep 3, 2024

CharlotteAnne commented Sep 3, 2024 via email

siddharthab commented Sep 3, 2024

siddharthab commented Sep 3, 2024

siddharthab commented Sep 4, 2024

apeltzer commented Sep 4, 2024

MatthiasZepper commented Sep 4, 2024

siddharthab commented Oct 3, 2024

MatthiasZepper commented Oct 8, 2024 • edited Loading

CharlotteAnne commented Oct 11, 2023 via email •

edited

Loading

CharlotteAnne commented Oct 11, 2023 via email •

edited

Loading

ctuni commented Nov 23, 2023 •

edited

Loading

MatthiasZepper commented Oct 8, 2024 •

edited

Loading