Skip to content

Basecalling Settings

mattloose edited this page Oct 27, 2020 · 4 revisions

Basecalling

The speed of basecalling is crucial for read until. Specifically, basecalling a batch of data must take the same or less time than it takes to generate the next batch of data to be processed. If not, then we end up with lag in the system which will ultimately cause our processing to fall behind.

One of the benefits of readfish is that we can use a variety of different basecallers by switching the configuration in our toml file.

As of time of writing, we are using the fast model and so the suggested toml file is configured as:

[caller_settings]
config_name = "dna_r9.4.1_450bps_fast"
host = "127.0.0.1"
port = 5555

With guppy version 4.0.11 the fast model gives the following performance on our GridION mk1 with 30 minutes running. Recall we are selecting chromosomes 21 and 22 from a background of long reads (all tables generated using readfish summary).

Guppy 4.0.11 FAST model 30 minutes

contig  number      sum   min     max    std   mean  median    N50
  chr1    2045  8031506   220  318254  15566   3927    1476  26513
 chr10    1109  4723969   263  261207  14559   4260    1592  27313
 chr11    1232  4754809   213  304465  16228   3859    1314  38943
 chr12    1050  4526674   261  166256  12536   4311    1508  23582
 chr13     684  3126069   184  299397  18358   4570    1573  35034
 chr14     796  4263462   242  249680  18806   5356    1502  37446
 chr15     994  5240288   245  187955  17111   5272    1489  48036
 chr16     429  2702573   233  180260  16343   6300    1841  33347
 chr17     574  3453521   271  388105  23709   6017    1482  69464
 chr18     538  3873005   349  274407  24659   7199    1424  70263
 chr19     483  2625211   248  163416  16557   5435    1564  43457
  chr2    1402  8174215   220  303798  19553   5830    1526  42543
 chr20     342  2214472   225  209686  20661   6475    1456  55394
 chr21      57  1758058   347  254718  46708  30843    9409  83729
 chr22      69   851125   447   77401  15509  12335    5952  25811
  chr3    2119  7585521   197  325017  14512   3580    1412  25708
  chr4    1367  8772764   211  307864  23260   6418    1605  64709
  chr5    1527  6629025   221  223762  15298   4341    1385  40421
  chr6    1450  6101223   236  260918  15773   4208    1514  28634
  chr7    1291  5812463   155  350180  16907   4502    1540  34863
  chr8    1001  4849272   214  317181  19480   4844    1420  50186
  chr9    1113  5505104   219  485498  21692   4946    1504  40112
  chrM      84   298104   276   16409   4132   3549    1694   9137
  chrX     941  5713488   213  320496  22251   6072    1409  65532
  chrY       5   138365  2043   87799  35701  27673   14315  87799

There are many caveats to this experiment, in particular that we are running in playback mode, which places the system under more strain.

However, switching to the latest high accuracy model (in Guppy 4.0.11) gives significantly worse performance for read until:

Guppy 4.0.11 HAC model 30 minutes

contig  number      sum   min     max    std   mean  median    N50
  chr1     733  8285664   220  406507  25659  11304    3180  42161
 chr10     481  4678846   267  244576  20463   9727    3683  27964
 chr11     551  4802221   240  304389  22116   8715    2629  40606
 chr12     453  4592356   262   98353  16855  10138    3606  35292
 chr13     187  3172222   229  299446  34624  16964    4877  61316
 chr14     309  4297006   273  249680  27665  13906    4391  39615
 chr15     499  5235844   238  187959  21503  10493    3502  44867
 chr16     167  2716865   233  180217  26613  16269    4395  48020
 chr17     325  3481102   258  164804  21736  10711    3388  42301
 chr18     263  3798711   282  248363  30222  14444    3656  53406
 chr19     265  2637241   197   92922  17029   9952    2941  43151
  chr2     620  8242404   220  303798  26675  13294    4113  43128
 chr20     197  2054304   225  120132  17918  10428    3924  36266
 chr21      62  1886326   347  254678  43852  30425   11500  77745
 chr22      76   867581   258   77474  15408  11416    4956  23706
  chr3     811  7560099   192  278385  21312   9322    3227  38050
  chr4     586  8735471   211  307864  32165  14907    3635  59877
  chr5     647  6704169   221  301482  22082  10362    3133  40230
  chr6     506  6125346   213  252947  25217  12105    3764  43742
  chr7     482  5865795   155  151215  21287  12170    4046  40745
  chr8     446  4878406   232  334993  26734  10938    2792  46069
  chr9     488  5584655   219  246513  27296  11444    3592  43346
  chrM      53   300841   589   16444   4908   5676    3389  11734
  chrX     467  5773783   192  184122  24322  12364    3060  53785
  chrY       6   144364  2535   66157  23474  24061   19538  31673

This compares unfavourably with our previous experiments using guppy 3.4.5 which gave better performance as measured by the median read lengths:

Guppy 3.4.5 HAC model approx 30 minutes

contig  number      sum   min     max    std   mean  median     N50
  chr1    1326  4187614   142  224402  14007   3158     795   48026
 chr10     804  2843010   275  248168  15930   3536     842   47764
 chr11     672  2510741   184  310591  18572   3736     841   73473
 chr12     871  2317742   292  116848   9929   2661     825   37159
 chr13     391  1090012   227  189103  12690   2788     781   41292
 chr14     469  2323329   275  251029  20107   4954     830   68887
 chr15     753  2189326   180  154830  12371   2907     812   40686
 chr16     522  1673329   218  166941  12741   3206     862   39258
 chr17     484  1609208   191  169651  15777   3325     816   73019
 chr18     483  1525953   230  252901  14414   3159     813   40090
 chr19     664  1898289   249  171742  13181   2859     820   46271
  chr2    1474  4279420   234  222310  13090   2903     820   43618
 chr20     489  1622910   229  171322  13223   3319     887   33669
 chr21      32  1221224  1053  223477  56923  38163   13238  112200
 chr22      47   724863   244  184049  28113  15423    6781   33464
  chr3    1142  3554814   243  247771  15173   3113     760   62683
  chr4    1224  4402210   210  221084  15769   3597     820   66686
  chr5    1371  4495150   205  330821  16699   3279     801   65394
  chr6     978  2725891   246  146169  10995   2787     791   37791
  chr7    1039  3027136   166  263043  14705   2914     798   56567
  chr8     848  2581406   238  229150  15618   3044     772   44498
  chr9     893  3028224   259  247975  16011   3391     802   54953
  chrM     144   216047   215   20731   2562   1500     864    1391
  chrX     868  3124552   238  192451  15594   3600     832   49047
  chrY       8    47071   510   31654  10743   5884    1382   31654    

Investigation of changes made to guppy between version 3.4.5 and 4.0.11 include the introduction of a method to automatically determine scaling of read signal based on the assumption that a read begins with the adapter sequence. Whilst this assumption is true for a live run, it is not the case for a playback run. We therefore deactivated the adapter detection and tested playback on both FAST and HAC models.

Guppy 4.0.11 FAST model 30 minutes No Adapter Scaling

contig  number      sum  min     max    std   mean  median     N50
  chr1    3053  7849544  207  316449  14106   2571     653   55618
 chr10    1658  4471927  154  255793  14379   2697     666   67711
 chr11    1907  4613111  204  282876  13041   2419     656   54966
 chr12    2120  4334063  248  171080   9222   2044     665   36669
 chr13     901  3026257  229  282446  18711   3359     643  111894
 chr14    1364  4308992  243  196186  16013   3159     660   89564
 chr15    2305  5079309  206  247184  11871   2204     645   48634
 chr16    1102  2603298  169  231980  13117   2362     649   54676
 chr17    1073  3381299  193  380211  19403   3151     658   94107
 chr18    1323  3671357  250  270706  15743   2775     651   71862
 chr19    1174  2512558  198  218797  11756   2140     653   52515
  chr2    3243  7913548  211  302877  13731   2440     648   54667
 chr20     897  2120284  223  209421  12436   2364     665   57498
 chr21      53  1600159  352  254804  50881  30192    9346   83729
 chr22      78   905753  259   77481  15722  11612    5048   23706
  chr3    2907  7328374  116  307554  14944   2521     652   71390
  chr4    3011  8590821  211  318854  16496   2853     653   71679
  chr5    2650  6477399  164  293380  13622   2444     642   62821
  chr6    2429  5855306  179  228499  13485   2411     654   69407
  chr7    2394  5601925  239  406765  15404   2340     652   61445
  chr8    2245  4745428  129  363992  13443   2114     658   38961
  chr9    1699  5401114  158  481317  20247   3179     644   82342
  chrM     267   273714  263   12193   1498   1025     693     845
  chrX    2204  5554044  213  315880  14960   2520     654   71840
  chrY      11    51956  608   31673   9629   4723     880   31673

Guppy 4.0.11 HAC model 30 minutes No Adapter Scaling

contig  number      sum   min     max    std   mean  median    N50
  chr1    3013  7962927   197  271757   6733   2643    2299   2655
 chr10    1641  4396876   242  131971   4557   2679    2410   2698
 chr11    1876  4727392   270  109019   3702   2520    2310   2636
 chr12    1831  4506902   251   55746   1806   2461    2351   2591
 chr13    1036  3095878   229  193423   8486   2988    2373   2791
 chr14    1556  4184954   232  180901   5438   2690    2390   2677
 chr15    1958  5245017   243  146558   5834   2679    2271   2666
 chr16    1066  2678811   226  100585   3180   2513    2364   2585
 chr17    1300  3434603   248  100618   4581   2642    2314   2666
 chr18    1456  3718745   313  153856   4601   2554    2312   2603
 chr19    1044  2543057   250  147737   5795   2436    2101   2482
  chr2    3323  8078612   202   98317   3159   2431    2248   2570
 chr20     771  2140444   225  165325   6099   2776    2522   2816
 chr21      55  1753684   347  254718  51230  31885    9409  84350
 chr22      72   880197   257   77474  16022  12225    5094  31103
  chr3    2919  7487377   197  166069   5487   2565    2315   2658
  chr4    3217  8748614   211  258923   6589   2719    2342   2696
  chr5    2589  6745847   221  289687   6848   2606    2280   2629
  chr6    2395  6001052   257  110848   3591   2506    2308   2611
  chr7    2273  5755082   242   75645   2756   2532    2392   2637
  chr8    1845  4830634   235  369405   8924   2618    2300   2686
  chr9    2055  5477756   219  220080   6462   2666    2290   2666
  chrM     133   296723   283    5449    939   2231    2290   2604
  chrX    2235  5652742   192  150517   4362   2529    2313   2595
  chrY       9    47385  1144   31673   9925   5265    1967  31673

As can be seen, inactivating the adapter scaling leads to a significant reduction in the median rejected read length. The reduction in basecalling quality is minimal and so the resultant basecalls are still more than sufficient to provide an accurate mapping of the read.

At this time we recommend users choose the FAST model for running readfish. Users may wish to experiment with deactivating adapter scaling. To do this, users can edit the appropriate config files in the guppy data directory. We strongly recommend saving these edited files with a new name. This allows you to run different configs for readfish basecalling without interrupting normal MinKNOW/GUPPY basecalling.

To do this edit the following line so that:

as_model_file                      = adapter_scaling_dna_r9.4.1_min.jsn

become:

#as_model_file                      = adapter_scaling_dna_r9.4.1_min.jsn

This will remove the adapter scaling.