Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java heap space #14

Open
karlkashofer opened this issue Feb 8, 2022 · 4 comments
Open

Java heap space #14

karlkashofer opened this issue Feb 8, 2022 · 4 comments

Comments

@karlkashofer
Copy link

Running umicollapse on 200mio paired end reads (400 reads total) runs out of Java heap space even with -Xmx96G.
Is that normal ?

@Daniel-Liu-c0deb0t
Copy link
Owner

It should not fail with only 400 reads. Have you tried setting -Xms to a larger value? That is the initial heap size. What is the exact command you are running? Paired-end mode takes up more memory, but it shouldn't run out of memory for only 400 reads.

@karlkashofer
Copy link
Author

Sorry, i meant 200mio paired end reads which is 400mio reads total.

@Daniel-Liu-c0deb0t
Copy link
Owner

If you are using paired-end mode (--paired), it takes a lot of memory. This is because it has to make sure pairs of reads stay together during the deduplication process. This involves storing a lot of reads in memory. Potential workarounds could be splitting the 200 million paired end reads into smaller files and deduplicating them, or not using paired-end mode (but then there might exist pairs of reads where only one read of the pair is removed).

@karlkashofer
Copy link
Author

karlkashofer commented Apr 5, 2022

Yes, i use --paired as this is Illumina NovaSeq data from Agilent XT libraries (dual index and dual UMI).
I dont really understand why --paired need so much memory. In your paper you state "the reads at each unique alignment location are independently deduplicated based on the UMI sequences. ", so i understand it only needs to keep all reads at a single position within memory. I deduplicate WGS data, there is hardly a position with more than 100 reads, so i really dont understand why it would require > 80GB of memory.

Thanks for your work btw ! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants