You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are training our own models for non-model organisms using R9.4.1 data, focusing our efforts towards data associated with protein-coding genes. When we use bonito to basecall our data using either our own trained model or the default models, we observe that the output reads are often only 50% as long as compared to the reads produced with guppy using the same FAST5 resources. We also get many more reads, especially when using our own model. While we are improving the base-level quality of the reads, we are concerned by the short resulting reads.
We wonder if there is some sort of clipping in the algorithm, e.g. that basecalling only proceeds until a point where the local quality score has dropped by some level compared to the overall quality of the read, and then terminates the calling at that point.
We have also noticed that older versions of bonito had default settings seemingly more in tune with R9 data, while newer versions seem to be adapted to R10 data. Perhaps this indirectly mirrors internal changes to the algorithms too, such that the current implementation of bonito does not fit legacy R9.4.1 data as well as it used to.
For the legacy R9.4.1 data we are currently exploring, would you recommend using an older version of bonito?
The text was updated successfully, but these errors were encountered:
The output of Bonito and Dorado (the successor to guppy) should be almost identical for the same released model, with marginal differences due to implementation details and optimisations. While R9 and R10 are different chemistries, there shouldn't be any algorithmic differences in basecalling between them and certainly there shouldn't be any regression in later versions of software.
Basecalling with any of our tools produces a sequence for the entire signal, and there is no mechanism for early termination of the output sequence. If the output q-score of a read is below a threshold quality then it will be omitted from the results, which may change the number of reads you see in your output.
Perhaps you can share some further details of your experiment and what models you are trying to train? Do you have the results from a subset of basecalls with latest bonito/dorado that you could share?
Dear developers,
We are training our own models for non-model organisms using R9.4.1 data, focusing our efforts towards data associated with protein-coding genes. When we use bonito to basecall our data using either our own trained model or the default models, we observe that the output reads are often only 50% as long as compared to the reads produced with guppy using the same FAST5 resources. We also get many more reads, especially when using our own model. While we are improving the base-level quality of the reads, we are concerned by the short resulting reads.
We wonder if there is some sort of clipping in the algorithm, e.g. that basecalling only proceeds until a point where the local quality score has dropped by some level compared to the overall quality of the read, and then terminates the calling at that point.
We have also noticed that older versions of bonito had default settings seemingly more in tune with R9 data, while newer versions seem to be adapted to R10 data. Perhaps this indirectly mirrors internal changes to the algorithms too, such that the current implementation of bonito does not fit legacy R9.4.1 data as well as it used to.
For the legacy R9.4.1 data we are currently exploring, would you recommend using an older version of bonito?
The text was updated successfully, but these errors were encountered: