-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue converting fast5 -> pod5 with multiple threads #146
Comments
Hi @arturotorreso, Just to confirm - you tested writing two files simultaneously, and confirmed it wasn't a bottleneck of writing to one file. but you did see increased performance when running the conversion on batches of smaller files (300 being optimal)? Can you confirm the only difference between the two tests where performance was different was number of files input to the converson script? Can you also let me know the approximate length of the reads in the files? Can you provide an example command line snippet you are using to trigger the conversion? Thanks,
|
Thank you for your quick response!
Yes, there was also a decreased performance when running multiple samples simultaneously and running to separate files. This was also dependent on the number of input files in each sample. If I run both samples with 300 input fast5 each, the decreased performance wasn't too bad (2000-3000 reads/s each with -t 4, versus 7000 reads/s if run separately). But if each sample was run with 5000 files, then the performance decreased to 40-50 reads/s. This does point out to a memory issue but in theory I should have enough CPUs and Mem to handle it.
Yes
Yes
We are working mostly with cell free DNA (~200bp), but we also find larger DNA fragments (>10kb). The read length distribution will be 216bp (157-776), but the range goes up to 37kb.
I'm running it straight from the command line:
And for subsets:
Let me know if you need anything else! |
Do you see the decrease in performance if you run the commands sequentially manually, or with a small gap? If you restart the terminal session and re run the experiment is it faster, or wait for a period after the run? I'll attempt to reproduce your results here.
|
If I run them manually, there's no decrease performance. With gaps, yes, I put a sleep of 1 minute and still saw the performance decrease. I don't need to restart the terminal, as soon as I kill the job and restart it, it goes faster until it eventually decreases performance again. Right now I'm running each file separately in a loop with -t 1, and merging afterwards, and it performs well. |
I am running pod5 convert fast5 on a sample with about 5000 fast5 files (from the sample samples, 4000 reads each), writing to a single pod5 per sample.
So I made subsets of the reads and compared the performance:
I thought this could be a bottleneck due to writing to the same file, but If I run two samples in the background simultaneously (thus writing to two different pod5 files) I run into the same situation of decreasing performance (similar to when using multiple threads on loads of files), and the jobs keep getting send to state D. My system should have enough memory to handle the job though.
For now I'm thinking of processing the files in batches and merging the final pod5, but I was curious to know if this is a known issue and what recommendations you have to improve performance when running multiple samples at the same time or with multiple threads.
The text was updated successfully, but these errors were encountered: