Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcftools run over 18 hours only calling 11Mb sequence of one chromosome #2000

Open
leon945945 opened this issue Sep 21, 2023 · 19 comments
Open

Comments

@leon945945
Copy link

Hi, I used this command to call variants with bcftools, but I felt that bcftools ran not fast as before, it has run over 18 hours, only called 11Mb sequence of one chromosome, it could call the whole genome or half genome sequences before for 18 hours in my mind.

$bcftools
Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
License: GNU GPLv3+, due to use of the GNU Scientific Library
Version: 1.17-23-gdaed344 (using htslib 1.16-22-g8e43fb0)

reference=ref.fa
bcftools mpileup --threads 10 -O b -m 2 -q 30 -Q 20 -a AD,ADF,ADR,DP,SP,SCR -f ${reference} bam1 bam2 bam3 bam4 bam5 bam6 | bcftools call --threads 10 -m -v -O b --ploidy 2 -o results.bcf

Could you please give me some advice to accelerate its running speed.

@pd3
Copy link
Member

pd3 commented Sep 21, 2023

It's possible you encountered a bug:

  • Can you try with the latest github version please?
  • If the problem persists, can you test if it's just slow, or it never gets past a certain coordinate?
  • In debugging the problem, can you try to split the mpileup and call step to identify which one is the culprit?
  • If it's a certain region that is problematic, any chance you could share a slice of your data?
  • What kind of data is it, short reads or long reads?

@leon945945
Copy link
Author

Thanks for your reply.
Q1: I'll try the latest version.
Q2: It's slow, the bcf output file is getting large.
Q3: I did split mpileup and call before, they were normal, it gets wired recent days, maybe because I changed some libraries when installing other software, I'm not sure.
Q4: Not this problem.
Q5: HiFi reads.

@leon945945
Copy link
Author

The latest version on github still ran slowly under my environment. 500kb called for 4~5 hours.

@jkbonfield
Copy link
Contributor

Out of interest, is it still very slow with -I option to disable indel calling? I found before that this was very slow on ONT data, and generally pretty poor too as it doesn't like the large indel error rate. It ought to be OK on HiFi data though, but this would at least tell us which bit of code is the slow bit. Although I'd be surprised if it wasn't the indel caller causing the issues.

You may also consider using -X pacbio-ccs (although I just noticed an error in the usage - it claims it enables -D, but that option changed meaning and it's not using full BAQ, it enables partial back which subsequently became the default so the -D should be removed from the usage statement for pacbio-ccs). I doubt enabling pacbio-ccs config will speed it up though. Note if you do enable it, put any options such as -I or -B after the -X parameter or they may get overwritten by the catch-all config.

@leon945945
Copy link
Author

@jkbonfield Its speed recovered normally with -I option, but indels are still important variations, it is unreasonable to rule out indels. And you are right, with -X pacbio-ccs did not speed it up. Thanks for your solution very much.

@jkbonfield
Copy link
Contributor

Agreed indels matter. It was more as a useful guide to know where the problem lay. Sadly it's what I expected though. The indel realignment code simply was never designed for long reads and the band size can get unwieldy, making it run glacial on long reads. :( Does disabling BAQ (-B, and not doing -I) speed it up significantly?

It's been a while since I looked at the code, but I think I did fix the worst slowdowns, such as only doing realignment within a portion of the read rather than the entire thing (ie assuming it's short).

@pd3
Copy link
Member

pd3 commented Sep 21, 2023

While experimenting, can you also try --indels-2.0, if that makes any difference in speed...

@leon945945
Copy link
Author

While experimenting, can you also try --indels-2.0, if that makes any difference in speed...

Hi, I tried --indels-2.0bcftools still runs slowly. But it performs well on NGS data. Are the properties of HiFi data which make bcftools run slowly? If so, what are the potential reasons?

@pd3
Copy link
Member

pd3 commented Oct 12, 2023

Is it possible to share a small anonymized BAM slice for debugging?

@leon945945
Copy link
Author

The alignent of first 300kb of chr1 was generated, hope it is helpful to you.
test.chr1-300kb.bam.txt

@jkbonfield
Copy link
Contributor

Thank you for the data. A very quick test shows it was spending almost all of its time in probaln_glocal, which is basically the realignment used by the indel caller. As expected, using -I to turn off indel calls made the code easily 100x quicker.

Now obviously you probably want indel calls, but do you want ALL of them irrespective of likelihood? This is quite deep data. We initiating a candidate indel when just 2 reads have an indel (-m 2) or 5% of the total (-F 0.05 default). Both of these can be substantially increased to speed up the code. I do see there are fewer called indels, but without manual inspection I can't say whether or not the ones that get missed were valid or not. It's something you should probably play around with however. You can use bcftools isec (or just some bcftools query -i 'type="INDEL"' -f "%POS" and sort | comm commands) to find calls made by the original version that aren't in the filtered one, and validate manually whether you think you'd be missing out on real calls.

I note that in the default parameters, bcftools mpileup called some 2021 candidate indels, of which only 474 (23%) made it past bcftools call. With -m 3 -F 0.1 those drop to 923 and 436, with -m 3 -F 0.2 to 672 and 430, and with -m5 -F0.2 to 581 and 391. It's still not going to be quick, but it's already three times as fast as the previous parameters at that point (albeit 10x slower than -I. The question is whether the things you're not getting called matters basically.

Also @pd3 - I tried --indels-2.0 and it errors out with an assertion failure.

@jkbonfield
Copy link
Contributor

Also for completeness, I tried the rejected PR #1679 on this data. (See comment #1679 (comment))

bcftools-1.18 (-m 2 -F 0.05) took 51s. Mpileup gave 2021 indels, call filtered to 474 indels.
bcftools-PR1679 (-m 2 -F 0.05) took 24s, giving 2111 / 1229 indels (so far more indels after call).

Focusing on PR1679:
With -X pacbio-ccs it's 20s and 1920/1201 calls.
With -X pacbio-ccs -m3 -F0.2 it's 16s and 1117 / 1058 indels called.
With -X pacbio-ccs -m5 -F0.3 it's 13s and 938/907 indels called.

Focusing on -X pacbio-ccs -m3 -F0. vs your parameters with bcftools 1.18, we have 474 orig with 113 only in that output, 1058 from PR with 697 only in that output, and 361 called by both versions.

Eyeballing a few in a narrow region:

pos real? called v1.18 called PR1679
124965 yes no
125000 yes yes yes
125213 yes no yes
125969 no yes yes
126355 yes no yes
126459 yes no yes
126530 yes yes no
126680 yes no yes
126730 yes no yes
127035 yes no yes
127271 yes no yes
127315 yes no yes
127343 yes no yes
127522 no no yes
127571 yes no yes
127688 yes? no yes
127879 yes no yes
127943 yes no yes
128355 yes no yes
128366 yes? no yes
128422 yes no no
128510 yes no yes
128536 yes no yes
128658 yes no yes
128730 yes yes yes
128831 yes yes yes
128917 yes no yes
128931 yes no yes
128942 yes no yes
128955 yes no yes

Basically there are a few cases I see where 1.18 called correctly and the PR missed, but the vast majority of new calls in the PR are genuine. So I think this is demonstration that the PR shouldn't have been rejected. It's clearly doing both a better job of calling AND being considerably faster to boot.

@jkbonfield
Copy link
Contributor

Back to the develop bcftools:

I've argued (unsuccessfully) before that the BAQ HMM in probaln_glocal is not the correct tool for evaluating the score for an alignment. It's a sledgehammer. What we need is to know the candidate indel target sequences and to align against each in turn to see the query matches, irrespective of whether it has STRs and other such things in there (we can account for those elsewhere), so it ought to use a proper alignment algorithm instead of the BAQ HMM.

I decided to do a VERY quick hack and see:

@@ -531,8 +554,36 @@ static int bcf_cgp_align_score(bam_pileup1_t *p, bcf_callaux_t *bca,
 
     // The bottom 8 bits are length-normalised score while
     // the top bits are unnormalised.
-    sc = probaln_glocal(ref2 + tbeg - left, tend - tbeg + type,
-                        query, qend - qbeg, qq, &apf, 0, 0);
+    // qq ignored
+    int8_t mat[25] = {
+         1, -3, -3, -3, 0,
+        -3,  1, -3, -3, 0,
+        -3, -3,  1, -3, 0,
+        -3, -3, -3,  1, 0,
+         0,  0,  0,  0, 0
+    };
+
+    kswr_t ksw = 
+        ksw_align(qend - qbeg, query, // query
+                  tend - tbeg + type, ref2 + tbeg - left, // target
+                  5,   // number of reside types
+                  mat, // 5 x 5 scoring,
+                  5,   // gapo,
+                  2,   // gape,
+                  KSW_XSTART,   // xtra
+                  //type + 3, // no band avail?
+                  NULL // qry
+                  );
+
+    sc = 3*((qend - qbeg)-ksw.score)+ksw.qb+((qend - qbeg)-ksw.qe);
+

It's crude, but it attempts to give a score in vaguely the same orientation and scale. This doesn't use quality values at all, and instead opts for the simple ksw algorithm.

This sped up mpileup from 50s to 9s. It gave 535 indels (vs 474 in the original), of which 400 are shared. Of the 74 missed, 35 were called by my PR and 39 not (so close to half). Of the 135 extra not seen in the original code, 119 match the PR's calls. Given my previous assessments, that appears to imply it works well.

To me this validates the notion that BAQ is just a plain poor choice for evaluating alignment quality. It is enormously slow while not actually being very good at telling good from bad alignment, plus it causes a lot of missed true variants due to rejecting alignments based on them containing repeats. We'd probably want to retrofit some score adjustment technique based on the quality scores too.

I'm being pulled in many directions at the moment and I'm not sure I'd have time to do a full evaluation, plus I have little desire given my previous efforts at improving bcftools indel calling were ignored unless there was an upfront assertion changes would be accepted (I spent months improving the indel caller before with no gain). However I strongly reiterate my suggestion that using BAQ as a proxy for alignment validity is not a good idea and we can do so much better and so much faster by using a proper alignment algorithm. (Ksw was just the easiest to pick up and go, but there are any number of alternatives out there.)

@leon945945
Copy link
Author

@jkbonfield Thanks very much for your work, I get your message. The indel realignment was the speed-limiting step, and there may be better methods to evaluate indels.

@jkbonfield
Copy link
Contributor

Long term, I think it needs a bigger revamp. What works for small indels and long indels differs. We probably want two different algorithms. When the code was originally written we only had short reads available, which also limited some of what it could do and focused the design in ways that aren't favourable for long read technologies.

Some of that has been compensated for, such as only aligning portions of reads, but it's still not ideal.

@leon945945
Copy link
Author

@jkbonfield Hi, any efficient tools suggested to call SNP/InDel from HiFi data? It seems that GATK4 is not optimized for HiFi data as well? Thank you.

@jkbonfield
Copy link
Contributor

The long and the short of it, is Google DeepVariant wipes the floor with us! It's probably even slower though (but I haven't tested it).

For bcftools, we're not so great on HiFi data currently, but I have a work in progress improvement and I think bcftools was already ahead of GATK on HiFi. See #2045 (comment)

You'll see from that though how far ahead DV is! I'm not sure what we're missing here really. Possibly some very cunning filtering and smoothing of qualities?

@leon945945
Copy link
Author

Yes, I see that DeepVariant outperforms too much on PacBio HiFi 53x data.
Actually I have tried DeepVariant several times on NGS data and compared its results with results of GATK, what surprised me was that the number of variations of DeepVariant was great less than that of GATK. It confused me so much, and some genotypes were weird. Do not know how to let DeepVariant work at its best ability on variant calling, hh.

@jkbonfield
Copy link
Contributor

Me neither. The DeepVariant figures in my graph just came from the DV supplied VCF file that came with HG002. It's also quite possible it was trained on that same data set and it performs poorer on others, in which case it's hardly a fair test. I've never ran it myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants