bcftools run over 18 hours only calling 11Mb sequence of one chromosome #2000

leon945945 · 2023-09-21T03:17:43Z

Hi, I used this command to call variants with bcftools, but I felt that bcftools ran not fast as before, it has run over 18 hours, only called 11Mb sequence of one chromosome, it could call the whole genome or half genome sequences before for 18 hours in my mind.

$bcftools
Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
License: GNU GPLv3+, due to use of the GNU Scientific Library
Version: 1.17-23-gdaed344 (using htslib 1.16-22-g8e43fb0)

reference=ref.fa
bcftools mpileup --threads 10 -O b -m 2 -q 30 -Q 20 -a AD,ADF,ADR,DP,SP,SCR -f ${reference} bam1 bam2 bam3 bam4 bam5 bam6 | bcftools call --threads 10 -m -v -O b --ploidy 2 -o results.bcf

Could you please give me some advice to accelerate its running speed.

The text was updated successfully, but these errors were encountered:

pd3 · 2023-09-21T06:33:08Z

It's possible you encountered a bug:

Can you try with the latest github version please?
If the problem persists, can you test if it's just slow, or it never gets past a certain coordinate?
In debugging the problem, can you try to split the mpileup and call step to identify which one is the culprit?
If it's a certain region that is problematic, any chance you could share a slice of your data?
What kind of data is it, short reads or long reads?

leon945945 · 2023-09-21T08:47:30Z

Thanks for your reply.
Q1: I'll try the latest version.
Q2: It's slow, the bcf output file is getting large.
Q3: I did split mpileup and call before, they were normal, it gets wired recent days, maybe because I changed some libraries when installing other software, I'm not sure.
Q4: Not this problem.
Q5: HiFi reads.

leon945945 · 2023-09-21T10:07:34Z

The latest version on github still ran slowly under my environment. 500kb called for 4~5 hours.

jkbonfield · 2023-09-21T10:25:01Z

Out of interest, is it still very slow with -I option to disable indel calling? I found before that this was very slow on ONT data, and generally pretty poor too as it doesn't like the large indel error rate. It ought to be OK on HiFi data though, but this would at least tell us which bit of code is the slow bit. Although I'd be surprised if it wasn't the indel caller causing the issues.

You may also consider using -X pacbio-ccs (although I just noticed an error in the usage - it claims it enables -D, but that option changed meaning and it's not using full BAQ, it enables partial back which subsequently became the default so the -D should be removed from the usage statement for pacbio-ccs). I doubt enabling pacbio-ccs config will speed it up though. Note if you do enable it, put any options such as -I or -B after the -X parameter or they may get overwritten by the catch-all config.

leon945945 · 2023-09-21T11:02:52Z

@jkbonfield Its speed recovered normally with -I option, but indels are still important variations, it is unreasonable to rule out indels. And you are right, with -X pacbio-ccs did not speed it up. Thanks for your solution very much.

jkbonfield · 2023-09-21T11:17:25Z

Agreed indels matter. It was more as a useful guide to know where the problem lay. Sadly it's what I expected though. The indel realignment code simply was never designed for long reads and the band size can get unwieldy, making it run glacial on long reads. :( Does disabling BAQ (-B, and not doing -I) speed it up significantly?

It's been a while since I looked at the code, but I think I did fix the worst slowdowns, such as only doing realignment within a portion of the read rather than the entire thing (ie assuming it's short).

pd3 · 2023-09-21T11:49:09Z

While experimenting, can you also try --indels-2.0, if that makes any difference in speed...

leon945945 · 2023-10-10T03:17:35Z

While experimenting, can you also try --indels-2.0, if that makes any difference in speed...

Hi, I tried --indels-2.0，bcftools still runs slowly. But it performs well on NGS data. Are the properties of HiFi data which make bcftools run slowly? If so, what are the potential reasons?

pd3 · 2023-10-12T08:22:14Z

Is it possible to share a small anonymized BAM slice for debugging?

leon945945 · 2023-10-12T15:13:28Z

The alignent of first 300kb of chr1 was generated, hope it is helpful to you.
test.chr1-300kb.bam.txt

jkbonfield · 2023-10-13T08:38:36Z

Thank you for the data. A very quick test shows it was spending almost all of its time in probaln_glocal, which is basically the realignment used by the indel caller. As expected, using -I to turn off indel calls made the code easily 100x quicker.

Now obviously you probably want indel calls, but do you want ALL of them irrespective of likelihood? This is quite deep data. We initiating a candidate indel when just 2 reads have an indel (-m 2) or 5% of the total (-F 0.05 default). Both of these can be substantially increased to speed up the code. I do see there are fewer called indels, but without manual inspection I can't say whether or not the ones that get missed were valid or not. It's something you should probably play around with however. You can use bcftools isec (or just some bcftools query -i 'type="INDEL"' -f "%POS" and sort | comm commands) to find calls made by the original version that aren't in the filtered one, and validate manually whether you think you'd be missing out on real calls.

I note that in the default parameters, bcftools mpileup called some 2021 candidate indels, of which only 474 (23%) made it past bcftools call. With -m 3 -F 0.1 those drop to 923 and 436, with -m 3 -F 0.2 to 672 and 430, and with -m5 -F0.2 to 581 and 391. It's still not going to be quick, but it's already three times as fast as the previous parameters at that point (albeit 10x slower than -I. The question is whether the things you're not getting called matters basically.

Also @pd3 - I tried --indels-2.0 and it errors out with an assertion failure.

jkbonfield · 2023-10-13T09:22:48Z

Also for completeness, I tried the rejected PR #1679 on this data. (See comment #1679 (comment))

bcftools-1.18 (-m 2 -F 0.05) took 51s. Mpileup gave 2021 indels, call filtered to 474 indels.
bcftools-PR1679 (-m 2 -F 0.05) took 24s, giving 2111 / 1229 indels (so far more indels after call).

Focusing on PR1679:
With -X pacbio-ccs it's 20s and 1920/1201 calls.
With -X pacbio-ccs -m3 -F0.2 it's 16s and 1117 / 1058 indels called.
With -X pacbio-ccs -m5 -F0.3 it's 13s and 938/907 indels called.

Focusing on -X pacbio-ccs -m3 -F0. vs your parameters with bcftools 1.18, we have 474 orig with 113 only in that output, 1058 from PR with 697 only in that output, and 361 called by both versions.

Eyeballing a few in a narrow region:

pos	real?	called v1.18	called PR1679
124965	yes		no
125000	yes	yes	yes
125213	yes	no	yes
125969	no	yes	yes
126355	yes	no	yes
126459	yes	no	yes
126530	yes	yes	no
126680	yes	no	yes
126730	yes	no	yes
127035	yes	no	yes
127271	yes	no	yes
127315	yes	no	yes
127343	yes	no	yes
127522	no	no	yes
127571	yes	no	yes
127688	yes?	no	yes
127879	yes	no	yes
127943	yes	no	yes
128355	yes	no	yes
128366	yes?	no	yes
128422	yes	no	no
128510	yes	no	yes
128536	yes	no	yes
128658	yes	no	yes
128730	yes	yes	yes
128831	yes	yes	yes
128917	yes	no	yes
128931	yes	no	yes
128942	yes	no	yes
128955	yes	no	yes

Basically there are a few cases I see where 1.18 called correctly and the PR missed, but the vast majority of new calls in the PR are genuine. So I think this is demonstration that the PR shouldn't have been rejected. It's clearly doing both a better job of calling AND being considerably faster to boot.

jkbonfield · 2023-10-13T11:57:46Z

Back to the develop bcftools:

I've argued (unsuccessfully) before that the BAQ HMM in probaln_glocal is not the correct tool for evaluating the score for an alignment. It's a sledgehammer. What we need is to know the candidate indel target sequences and to align against each in turn to see the query matches, irrespective of whether it has STRs and other such things in there (we can account for those elsewhere), so it ought to use a proper alignment algorithm instead of the BAQ HMM.

I decided to do a VERY quick hack and see:

@@ -531,8 +554,36 @@ static int bcf_cgp_align_score(bam_pileup1_t *p, bcf_callaux_t *bca,
 
     // The bottom 8 bits are length-normalised score while
     // the top bits are unnormalised.
-    sc = probaln_glocal(ref2 + tbeg - left, tend - tbeg + type,
-                        query, qend - qbeg, qq, &apf, 0, 0);
+    // qq ignored
+    int8_t mat[25] = {
+         1, -3, -3, -3, 0,
+        -3,  1, -3, -3, 0,
+        -3, -3,  1, -3, 0,
+        -3, -3, -3,  1, 0,
+         0,  0,  0,  0, 0
+    };
+
+    kswr_t ksw = 
+        ksw_align(qend - qbeg, query, // query
+                  tend - tbeg + type, ref2 + tbeg - left, // target
+                  5,   // number of reside types
+                  mat, // 5 x 5 scoring,
+                  5,   // gapo,
+                  2,   // gape,
+                  KSW_XSTART,   // xtra
+                  //type + 3, // no band avail?
+                  NULL // qry
+                  );
+
+    sc = 3*((qend - qbeg)-ksw.score)+ksw.qb+((qend - qbeg)-ksw.qe);
+

It's crude, but it attempts to give a score in vaguely the same orientation and scale. This doesn't use quality values at all, and instead opts for the simple ksw algorithm.

This sped up mpileup from 50s to 9s. It gave 535 indels (vs 474 in the original), of which 400 are shared. Of the 74 missed, 35 were called by my PR and 39 not (so close to half). Of the 135 extra not seen in the original code, 119 match the PR's calls. Given my previous assessments, that appears to imply it works well.

To me this validates the notion that BAQ is just a plain poor choice for evaluating alignment quality. It is enormously slow while not actually being very good at telling good from bad alignment, plus it causes a lot of missed true variants due to rejecting alignments based on them containing repeats. We'd probably want to retrofit some score adjustment technique based on the quality scores too.

I'm being pulled in many directions at the moment and I'm not sure I'd have time to do a full evaluation, plus I have little desire given my previous efforts at improving bcftools indel calling were ignored unless there was an upfront assertion changes would be accepted (I spent months improving the indel caller before with no gain). However I strongly reiterate my suggestion that using BAQ as a proxy for alignment validity is not a good idea and we can do so much better and so much faster by using a proper alignment algorithm. (Ksw was just the easiest to pick up and go, but there are any number of alternatives out there.)

leon945945 · 2023-10-14T02:27:54Z

@jkbonfield Thanks very much for your work, I get your message. The indel realignment was the speed-limiting step, and there may be better methods to evaluate indels.

jkbonfield · 2023-10-16T08:20:43Z

Long term, I think it needs a bigger revamp. What works for small indels and long indels differs. We probably want two different algorithms. When the code was originally written we only had short reads available, which also limited some of what it could do and focused the design in ways that aren't favourable for long read technologies.

Some of that has been compensated for, such as only aligning portions of reads, but it's still not ideal.

leon945945 · 2024-01-12T09:36:53Z

@jkbonfield Hi, any efficient tools suggested to call SNP/InDel from HiFi data? It seems that GATK4 is not optimized for HiFi data as well? Thank you.

jkbonfield · 2024-01-12T09:44:26Z

The long and the short of it, is Google DeepVariant wipes the floor with us! It's probably even slower though (but I haven't tested it).

For bcftools, we're not so great on HiFi data currently, but I have a work in progress improvement and I think bcftools was already ahead of GATK on HiFi. See #2045 (comment)

You'll see from that though how far ahead DV is! I'm not sure what we're missing here really. Possibly some very cunning filtering and smoothing of qualities?

leon945945 · 2024-01-12T15:10:52Z

Yes, I see that DeepVariant outperforms too much on PacBio HiFi 53x data.
Actually I have tried DeepVariant several times on NGS data and compared its results with results of GATK, what surprised me was that the number of variations of DeepVariant was great less than that of GATK. It confused me so much, and some genotypes were weird. Do not know how to let DeepVariant work at its best ability on variant calling, hh.

jkbonfield · 2024-01-12T15:44:43Z

Me neither. The DeepVariant figures in my graph just came from the DV supplied VCF file that came with HG002. It's also quite possible it was trained on that same data set and it performs poorer on others, in which case it's hardly a fair test. I've never ran it myself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bcftools run over 18 hours only calling 11Mb sequence of one chromosome #2000

bcftools run over 18 hours only calling 11Mb sequence of one chromosome #2000

leon945945 commented Sep 21, 2023

pd3 commented Sep 21, 2023

leon945945 commented Sep 21, 2023

leon945945 commented Sep 21, 2023

jkbonfield commented Sep 21, 2023

leon945945 commented Sep 21, 2023

jkbonfield commented Sep 21, 2023

pd3 commented Sep 21, 2023

leon945945 commented Oct 10, 2023

pd3 commented Oct 12, 2023

leon945945 commented Oct 12, 2023

jkbonfield commented Oct 13, 2023

jkbonfield commented Oct 13, 2023

jkbonfield commented Oct 13, 2023

leon945945 commented Oct 14, 2023

jkbonfield commented Oct 16, 2023

leon945945 commented Jan 12, 2024

jkbonfield commented Jan 12, 2024

leon945945 commented Jan 12, 2024

jkbonfield commented Jan 12, 2024

bcftools run over 18 hours only calling 11Mb sequence of one chromosome #2000

bcftools run over 18 hours only calling 11Mb sequence of one chromosome #2000

Comments

leon945945 commented Sep 21, 2023

pd3 commented Sep 21, 2023

leon945945 commented Sep 21, 2023

leon945945 commented Sep 21, 2023

jkbonfield commented Sep 21, 2023

leon945945 commented Sep 21, 2023

jkbonfield commented Sep 21, 2023

pd3 commented Sep 21, 2023

leon945945 commented Oct 10, 2023

pd3 commented Oct 12, 2023

leon945945 commented Oct 12, 2023

jkbonfield commented Oct 13, 2023

jkbonfield commented Oct 13, 2023

jkbonfield commented Oct 13, 2023

leon945945 commented Oct 14, 2023

jkbonfield commented Oct 16, 2023

leon945945 commented Jan 12, 2024

jkbonfield commented Jan 12, 2024

leon945945 commented Jan 12, 2024

jkbonfield commented Jan 12, 2024