-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bcftools run over 18 hours only calling 11Mb sequence of one chromosome #2000
Comments
It's possible you encountered a bug:
|
Thanks for your reply. |
The latest version on github still ran slowly under my environment. 500kb called for 4~5 hours. |
Out of interest, is it still very slow with You may also consider using |
@jkbonfield Its speed recovered normally with |
Agreed indels matter. It was more as a useful guide to know where the problem lay. Sadly it's what I expected though. The indel realignment code simply was never designed for long reads and the band size can get unwieldy, making it run glacial on long reads. :( Does disabling BAQ ( It's been a while since I looked at the code, but I think I did fix the worst slowdowns, such as only doing realignment within a portion of the read rather than the entire thing (ie assuming it's short). |
While experimenting, can you also try |
Hi, I tried |
Is it possible to share a small anonymized BAM slice for debugging? |
The alignent of first 300kb of chr1 was generated, hope it is helpful to you. |
Thank you for the data. A very quick test shows it was spending almost all of its time in Now obviously you probably want indel calls, but do you want ALL of them irrespective of likelihood? This is quite deep data. We initiating a candidate indel when just 2 reads have an indel ( I note that in the default parameters, Also @pd3 - I tried --indels-2.0 and it errors out with an assertion failure. |
Also for completeness, I tried the rejected PR #1679 on this data. (See comment #1679 (comment)) bcftools-1.18 (-m 2 -F 0.05) took 51s. Mpileup gave 2021 indels, call filtered to 474 indels. Focusing on PR1679: Focusing on Eyeballing a few in a narrow region:
Basically there are a few cases I see where 1.18 called correctly and the PR missed, but the vast majority of new calls in the PR are genuine. So I think this is demonstration that the PR shouldn't have been rejected. It's clearly doing both a better job of calling AND being considerably faster to boot. |
Back to the develop bcftools: I've argued (unsuccessfully) before that the BAQ HMM in I decided to do a VERY quick hack and see:
It's crude, but it attempts to give a score in vaguely the same orientation and scale. This doesn't use quality values at all, and instead opts for the simple ksw algorithm. This sped up mpileup from 50s to 9s. It gave 535 indels (vs 474 in the original), of which 400 are shared. Of the 74 missed, 35 were called by my PR and 39 not (so close to half). Of the 135 extra not seen in the original code, 119 match the PR's calls. Given my previous assessments, that appears to imply it works well. To me this validates the notion that BAQ is just a plain poor choice for evaluating alignment quality. It is enormously slow while not actually being very good at telling good from bad alignment, plus it causes a lot of missed true variants due to rejecting alignments based on them containing repeats. We'd probably want to retrofit some score adjustment technique based on the quality scores too. I'm being pulled in many directions at the moment and I'm not sure I'd have time to do a full evaluation, plus I have little desire given my previous efforts at improving bcftools indel calling were ignored unless there was an upfront assertion changes would be accepted (I spent months improving the indel caller before with no gain). However I strongly reiterate my suggestion that using BAQ as a proxy for alignment validity is not a good idea and we can do so much better and so much faster by using a proper alignment algorithm. (Ksw was just the easiest to pick up and go, but there are any number of alternatives out there.) |
@jkbonfield Thanks very much for your work, I get your message. The indel realignment was the speed-limiting step, and there may be better methods to evaluate indels. |
Long term, I think it needs a bigger revamp. What works for small indels and long indels differs. We probably want two different algorithms. When the code was originally written we only had short reads available, which also limited some of what it could do and focused the design in ways that aren't favourable for long read technologies. Some of that has been compensated for, such as only aligning portions of reads, but it's still not ideal. |
@jkbonfield Hi, any efficient tools suggested to call SNP/InDel from HiFi data? It seems that |
The long and the short of it, is Google DeepVariant wipes the floor with us! It's probably even slower though (but I haven't tested it). For bcftools, we're not so great on HiFi data currently, but I have a work in progress improvement and I think bcftools was already ahead of GATK on HiFi. See #2045 (comment) You'll see from that though how far ahead DV is! I'm not sure what we're missing here really. Possibly some very cunning filtering and smoothing of qualities? |
Yes, I see that |
Me neither. The DeepVariant figures in my graph just came from the DV supplied VCF file that came with HG002. It's also quite possible it was trained on that same data set and it performs poorer on others, in which case it's hardly a fair test. I've never ran it myself. |
Hi, I used this command to call variants with bcftools, but I felt that bcftools ran not fast as before, it has run over 18 hours, only called 11Mb sequence of one chromosome, it could call the whole genome or half genome sequences before for 18 hours in my mind.
reference=ref.fa
bcftools mpileup --threads 10 -O b -m 2 -q 30 -Q 20 -a AD,ADF,ADR,DP,SP,SCR -f ${reference} bam1 bam2 bam3 bam4 bam5 bam6 | bcftools call --threads 10 -m -v -O b --ploidy 2 -o results.bcf
Could you please give me some advice to accelerate its running speed.
The text was updated successfully, but these errors were encountered: