-
Notifications
You must be signed in to change notification settings - Fork 34
Basecalling Settings
The speed of basecalling is crucial for read until. Specifically, basecalling a batch of data must take the same or less time than it takes to generate the next batch of data to be processed. If not, then we end up with lag in the system which will ultimately cause our processing to fall behind.
One of the benefits of readfish is that we can use a variety of different basecallers by switching the configuration in our toml file.
As of time of writing, we are using the fast model and so the suggested toml file is configured as:
[caller_settings]
config_name = "dna_r9.4.1_450bps_fast"
host = "127.0.0.1"
port = 5555
With guppy version 4.0.11 the fast model gives the following performance on our GridION mk1 with 30 minutes running. Recall we are selecting chromosomes 21 and 22 from a background of long reads (all tables generated using readfish summary
).
contig number sum min max std mean median N50 chr1 2045 8031506 220 318254 15566 3927 1476 26513 chr10 1109 4723969 263 261207 14559 4260 1592 27313 chr11 1232 4754809 213 304465 16228 3859 1314 38943 chr12 1050 4526674 261 166256 12536 4311 1508 23582 chr13 684 3126069 184 299397 18358 4570 1573 35034 chr14 796 4263462 242 249680 18806 5356 1502 37446 chr15 994 5240288 245 187955 17111 5272 1489 48036 chr16 429 2702573 233 180260 16343 6300 1841 33347 chr17 574 3453521 271 388105 23709 6017 1482 69464 chr18 538 3873005 349 274407 24659 7199 1424 70263 chr19 483 2625211 248 163416 16557 5435 1564 43457 chr2 1402 8174215 220 303798 19553 5830 1526 42543 chr20 342 2214472 225 209686 20661 6475 1456 55394 chr21 57 1758058 347 254718 46708 30843 9409 83729 chr22 69 851125 447 77401 15509 12335 5952 25811 chr3 2119 7585521 197 325017 14512 3580 1412 25708 chr4 1367 8772764 211 307864 23260 6418 1605 64709 chr5 1527 6629025 221 223762 15298 4341 1385 40421 chr6 1450 6101223 236 260918 15773 4208 1514 28634 chr7 1291 5812463 155 350180 16907 4502 1540 34863 chr8 1001 4849272 214 317181 19480 4844 1420 50186 chr9 1113 5505104 219 485498 21692 4946 1504 40112 chrM 84 298104 276 16409 4132 3549 1694 9137 chrX 941 5713488 213 320496 22251 6072 1409 65532 chrY 5 138365 2043 87799 35701 27673 14315 87799
There are many caveats to this experiment, in particular that we are running in playback mode, which places the system under more strain.
However, switching to the latest high accuracy model (in Guppy 4.0.11) gives significantly worse performance for read until:
contig number sum min max std mean median N50 chr1 733 8285664 220 406507 25659 11304 3180 42161 chr10 481 4678846 267 244576 20463 9727 3683 27964 chr11 551 4802221 240 304389 22116 8715 2629 40606 chr12 453 4592356 262 98353 16855 10138 3606 35292 chr13 187 3172222 229 299446 34624 16964 4877 61316 chr14 309 4297006 273 249680 27665 13906 4391 39615 chr15 499 5235844 238 187959 21503 10493 3502 44867 chr16 167 2716865 233 180217 26613 16269 4395 48020 chr17 325 3481102 258 164804 21736 10711 3388 42301 chr18 263 3798711 282 248363 30222 14444 3656 53406 chr19 265 2637241 197 92922 17029 9952 2941 43151 chr2 620 8242404 220 303798 26675 13294 4113 43128 chr20 197 2054304 225 120132 17918 10428 3924 36266 chr21 62 1886326 347 254678 43852 30425 11500 77745 chr22 76 867581 258 77474 15408 11416 4956 23706 chr3 811 7560099 192 278385 21312 9322 3227 38050 chr4 586 8735471 211 307864 32165 14907 3635 59877 chr5 647 6704169 221 301482 22082 10362 3133 40230 chr6 506 6125346 213 252947 25217 12105 3764 43742 chr7 482 5865795 155 151215 21287 12170 4046 40745 chr8 446 4878406 232 334993 26734 10938 2792 46069 chr9 488 5584655 219 246513 27296 11444 3592 43346 chrM 53 300841 589 16444 4908 5676 3389 11734 chrX 467 5773783 192 184122 24322 12364 3060 53785 chrY 6 144364 2535 66157 23474 24061 19538 31673
This compares unfavourably with our previous experiments using guppy 3.4.5 which gave better performance as measured by the median read lengths:
contig number sum min max std mean median N50 chr1 1326 4187614 142 224402 14007 3158 795 48026 chr10 804 2843010 275 248168 15930 3536 842 47764 chr11 672 2510741 184 310591 18572 3736 841 73473 chr12 871 2317742 292 116848 9929 2661 825 37159 chr13 391 1090012 227 189103 12690 2788 781 41292 chr14 469 2323329 275 251029 20107 4954 830 68887 chr15 753 2189326 180 154830 12371 2907 812 40686 chr16 522 1673329 218 166941 12741 3206 862 39258 chr17 484 1609208 191 169651 15777 3325 816 73019 chr18 483 1525953 230 252901 14414 3159 813 40090 chr19 664 1898289 249 171742 13181 2859 820 46271 chr2 1474 4279420 234 222310 13090 2903 820 43618 chr20 489 1622910 229 171322 13223 3319 887 33669 chr21 32 1221224 1053 223477 56923 38163 13238 112200 chr22 47 724863 244 184049 28113 15423 6781 33464 chr3 1142 3554814 243 247771 15173 3113 760 62683 chr4 1224 4402210 210 221084 15769 3597 820 66686 chr5 1371 4495150 205 330821 16699 3279 801 65394 chr6 978 2725891 246 146169 10995 2787 791 37791 chr7 1039 3027136 166 263043 14705 2914 798 56567 chr8 848 2581406 238 229150 15618 3044 772 44498 chr9 893 3028224 259 247975 16011 3391 802 54953 chrM 144 216047 215 20731 2562 1500 864 1391 chrX 868 3124552 238 192451 15594 3600 832 49047 chrY 8 47071 510 31654 10743 5884 1382 31654
Investigation of changes made to guppy between version 3.4.5 and 4.0.11 include the introduction of a method to automatically determine scaling of read signal based on the assumption that a read begins with the adapter sequence. Whilst this assumption is true for a live run, it is not the case for a playback run. We therefore deactivated the adapter detection and tested playback on both FAST and HAC models.
contig number sum min max std mean median N50 chr1 3053 7849544 207 316449 14106 2571 653 55618 chr10 1658 4471927 154 255793 14379 2697 666 67711 chr11 1907 4613111 204 282876 13041 2419 656 54966 chr12 2120 4334063 248 171080 9222 2044 665 36669 chr13 901 3026257 229 282446 18711 3359 643 111894 chr14 1364 4308992 243 196186 16013 3159 660 89564 chr15 2305 5079309 206 247184 11871 2204 645 48634 chr16 1102 2603298 169 231980 13117 2362 649 54676 chr17 1073 3381299 193 380211 19403 3151 658 94107 chr18 1323 3671357 250 270706 15743 2775 651 71862 chr19 1174 2512558 198 218797 11756 2140 653 52515 chr2 3243 7913548 211 302877 13731 2440 648 54667 chr20 897 2120284 223 209421 12436 2364 665 57498 chr21 53 1600159 352 254804 50881 30192 9346 83729 chr22 78 905753 259 77481 15722 11612 5048 23706 chr3 2907 7328374 116 307554 14944 2521 652 71390 chr4 3011 8590821 211 318854 16496 2853 653 71679 chr5 2650 6477399 164 293380 13622 2444 642 62821 chr6 2429 5855306 179 228499 13485 2411 654 69407 chr7 2394 5601925 239 406765 15404 2340 652 61445 chr8 2245 4745428 129 363992 13443 2114 658 38961 chr9 1699 5401114 158 481317 20247 3179 644 82342 chrM 267 273714 263 12193 1498 1025 693 845 chrX 2204 5554044 213 315880 14960 2520 654 71840 chrY 11 51956 608 31673 9629 4723 880 31673
contig number sum min max std mean median N50 chr1 3013 7962927 197 271757 6733 2643 2299 2655 chr10 1641 4396876 242 131971 4557 2679 2410 2698 chr11 1876 4727392 270 109019 3702 2520 2310 2636 chr12 1831 4506902 251 55746 1806 2461 2351 2591 chr13 1036 3095878 229 193423 8486 2988 2373 2791 chr14 1556 4184954 232 180901 5438 2690 2390 2677 chr15 1958 5245017 243 146558 5834 2679 2271 2666 chr16 1066 2678811 226 100585 3180 2513 2364 2585 chr17 1300 3434603 248 100618 4581 2642 2314 2666 chr18 1456 3718745 313 153856 4601 2554 2312 2603 chr19 1044 2543057 250 147737 5795 2436 2101 2482 chr2 3323 8078612 202 98317 3159 2431 2248 2570 chr20 771 2140444 225 165325 6099 2776 2522 2816 chr21 55 1753684 347 254718 51230 31885 9409 84350 chr22 72 880197 257 77474 16022 12225 5094 31103 chr3 2919 7487377 197 166069 5487 2565 2315 2658 chr4 3217 8748614 211 258923 6589 2719 2342 2696 chr5 2589 6745847 221 289687 6848 2606 2280 2629 chr6 2395 6001052 257 110848 3591 2506 2308 2611 chr7 2273 5755082 242 75645 2756 2532 2392 2637 chr8 1845 4830634 235 369405 8924 2618 2300 2686 chr9 2055 5477756 219 220080 6462 2666 2290 2666 chrM 133 296723 283 5449 939 2231 2290 2604 chrX 2235 5652742 192 150517 4362 2529 2313 2595 chrY 9 47385 1144 31673 9925 5265 1967 31673
As can be seen, inactivating the adapter scaling leads to a significant reduction in the median rejected read length. The reduction in basecalling quality is minimal and so the resultant basecalls are still more than sufficient to provide an accurate mapping of the read.
At this time we recommend users choose the FAST model for running readfish. Users may wish to experiment with deactivating adapter scaling. To do this, users can edit the appropriate config files in the guppy data directory. We strongly recommend saving these edited files with a new name. This allows you to run different configs for readfish basecalling without interrupting normal MinKNOW/GUPPY basecalling.
To do this edit the following line so that:
as_model_file = adapter_scaling_dna_r9.4.1_min.jsn
become:
#as_model_file = adapter_scaling_dna_r9.4.1_min.jsn
This will remove the adapter scaling.