-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inflated size of corrected reads compared to input fasta #26
Comments
Hello Annabel, Yes, such a behaviour has already been reported to me, but unfortunately, I haven't been able to reproduce it. This was a bug I corrected in a previous version of CONSENT, and such a behaviour is indeed abnormal, so thanks for pointing it out. I'm going to try to re-run some of my experiments and try to see if I can reproduce / pinpoint the problem. Meanwhile, I have a few questions so I can try to further help you resolve:
Best, |
Also, could you randomly pick any header from your corrected reads file, and perform a |
Hi Pierre, Thank you so much for getting back to me so quickly. I really appreciate your help.
|
Hi Annabel,
Also, do the binaries Finally, if you still have it, I'd be interested to take a look at your Best, |
Hi Pierre, Apologies, I'd noticed that I omitted the -n with the grep but then the repeat hadn't completed before my day ended. Here's the output that you actually asked for: All the binaries are there in the bin folder, and I could see from tracking the processes that the explode files got written, and the paf was revisited after explode. The minimap_jobid file is not there any longer (we're past the auto-delete step in CONSENT-correct) In terms of salvaging, do you think it might be possible to resort the paf I have and run just the correction step on that? Many thanks again for your help, |
Hi Annabel, No worries. Thanks for providing me the correct file. It indeed looks like there was an issue with the paf processing. However, if you're telling me the binaries are where they should be, and the the explode files and the paf file were processed correctly, this is super weird. I will take a closer look and rerun some of my experiments, since there is definitely something I'm overlooking here. However, I'm currently on holidays, so I can't promise I'll be taking a look before the end of the week. In terms of salvaging, sorting the paf file according to its first column and rerunning the correction step on that would most definitely work. Sorting a large paf file might take some serious time though, hence why I came up with the explode / merge thing, that should serve a similar purpose. I'll make sure to update you as soon as its fixed. You're very welcome, thanks to you for bringing up the issue! Best, |
Just pushed a fix that should fix the issue. I apologize for the inconvenience. Best, |
Hello Pierre,
I am running CONSENT-correct on a 20x PacBio dataset for a 1Gb genome. The version was cloned from your git repository on the 18th Feb 2021 (i.e. the most current version).
It has yet to complete, but in the process of trying to figure out how close to done it might be, I have been checking the output.
There are 3.2M uncorrected reads in the dataset but over 13M corrected reads so far written to the corrected.fasta. This is not simply a case of reads being split as there are 23Gb of sequence in the input dataset and >82Gb in the output.
I have seen some indications in the issues thread that this behaviour has been seen before but I would value your opinion on if/how I can salvage something from this run (or how to avoid this problem on repeating).
I checked for header uniqueness by sorting output and running
uniq
and find that the inflation can be explained by this.Many thanks,
Annabel
The text was updated successfully, but these errors were encountered: