`> no suitable ctc data to write` when basecalling with `save-ctc` flag #308

HanielF · 2022-10-01T15:37:23Z

Hi~ I am trying to train a bonito model from scratch.

To obtain my training data, i basecalled the reads with the command as below:

data_path="~/codes/bonito/data"
key="Acinetobacter_baumannii_AYP-A2"

bonito basecaller "[email protected]" "${data_path}/train_data/${key}" \
   --save-ctc \
   --reference "${data_path}/train_data/${key}/read_references.fasta" \
   --batchsize 100 \
   --recursive \
   > "${data_path}/train_data/${key}/calls.sam"

It will raise an error, > no suitable ctc data to write, in bonito.io.CTCWriter.run(). It means that none of the reads can pass the checks.
To improve the accuracy, I replaced the fast model with [email protected]. Several hundred reads passed the checks this time.

Also, it's weird that only 413 reads were saved to CTC data, while a total of 100960 reads were input.

I checked the code in bonito.io.CTCWriter and found that most of the reads are filtered by self.min_accuracy, which is set to 0.99 by default.
Here is the statistical results for ctc-data:

......
> ctc results filtering: : 100960it [03:49, 440.28it/s]
Continue! self.min_accuracy: 0.99, acc: 0.8582375478927203 
Continue! self.min_accuracy: 0.99, acc: 0.9616935483870968 
Continue! self.min_accuracy: 0.99, acc: 0.960446247464503 
Continue! self.min_accuracy: 0.99, acc: 0.943952802359882 
Continue! self.min_accuracy: 0.99, acc: 0.9434968017057569 
Continue! self.min_accuracy: 0.99, acc: 0.967741935483871 
Continue! self.min_accuracy: 0.99, acc: 0.9537401574803149 
=== Error analysis ===
seq_err: 0, mapping_err: 38, refseq_err: 0, acc_err: 100501, cov_err: 0
> written ctc training data
  - chunks.npy with shape (413,10000)
  - references.npy with shape (413,1053)
  - reference_lengths.npy shape (413)
> completed reads: 100960
> duration: 0:03:49

Obviously, the CTC training data is not enough even though that is only one of 50 genomes for the whole training set.
Do I need to lower the value of the min_accuracy parameter?

The text was updated successfully, but these errors were encountered:

andreaswallberg · 2023-12-11T20:37:48Z

Dear @HanielF and @iiSeymour

We appear to be in a similar position. Several millions of reads as input, but only tens of thousands of reads appear to be used. My expectation was that much more of the raw data could be used.

Any suggestions on how to overcome this?

mark-sor2 · 2024-05-23T09:07:34Z

Hi @iiSeymour,

I have same issue. I have Pod5 which containes multiplexed reads.

In a first experiment I generated susbet pod5 where each one contains reads related to one unique reference sequence (I used the read id after demultiplexing to perform this operation, although some reads were not found in the pod5 files).

bonito basecaller [email protected] --min-accuracy-save-ctc 0.7 --save-ctc --reference ./Fasta/myRef.mmi ./Pod5/myRefPod5/ > ./ctcData/myRef.sam

The basecalling stop after few seconds and doesn't seems to work, I got next result :

reading pod5
outputting aligned sam
loading model [email protected]
model basecaller params: {'batchsize': 32, 'chunksize': 9996, 'overlap': 492, 'quantize': None}
loading reference
read scaling: {'strategy': 'pa'}
no suitable ctc data to write
completed reads: 61
duration: 0:00:05
samples per second 1.3E+05
done

In a second experiment I used all the pod5 (all the multiplexed reads) and provided one reference. I assumed that only the reads related to that reference will remain after the alignment operation. In this case the basecalling seems to works for a few minutes but then stops (at ~15% of the operaiton) and then it says "no suitable ctc data to write". below is the command I used and the resulting output :

bonito basecaller [email protected] --min-accuracy-save-ctc 0.7 --save-ctc --reference ./Fasta/myRef.mmi ./Pod5/allPod5/ > ./ctcData/myRef.sam

reading pod5
outputting aligned sam
loading model [email protected]
model basecaller params: {'batchsize': 96, 'chunksize': 9996, 'overlap': 492, 'quantize': None}
loading reference
read scaling: {'strategy': 'pa'}
no suitable ctc data to write
completed reads: 78814
duration: 0:06:59
samples per second 1.9E+06
done

My reference sequence contain around 400 bases, and the '.mmi' file was generated using ">minimap2 -x map-ont -d ./Fasta/myRef.mmi ./Fasta/myRef.fasta" command. I do not understand why I cannot perform the basecallinf using --save-ctc flag. Note that the basecalling performs well when not using this flag.

thanks in advance for your answer

sosoJuly822 · 2025-01-06T08:48:16Z

I have the same issue. I tried two datasets of different quality, but even with the slightly higher-quality dataset, the output .bam file was still empty after waiting for 4 hours.

I carefully reviewed the basecaller.py file and found that the process of generating labels involves segmenting the electrical signals based on the chunksize, performing basecalling for each chunk, and finally saving the results based on the score. I'm not sure what went wrong during this process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`> no suitable ctc data to write` when basecalling with `save-ctc` flag #308

`> no suitable ctc data to write` when basecalling with `save-ctc` flag #308

HanielF commented Oct 1, 2022 •

edited

Loading

andreaswallberg commented Dec 11, 2023

mark-sor2 commented May 23, 2024

sosoJuly822 commented Jan 6, 2025

> no suitable ctc data to write when basecalling with save-ctc flag #308

> no suitable ctc data to write when basecalling with save-ctc flag #308

Comments

HanielF commented Oct 1, 2022 • edited Loading

andreaswallberg commented Dec 11, 2023

mark-sor2 commented May 23, 2024

sosoJuly822 commented Jan 6, 2025

`> no suitable ctc data to write` when basecalling with `save-ctc` flag #308

`> no suitable ctc data to write` when basecalling with `save-ctc` flag #308

HanielF commented Oct 1, 2022 •

edited

Loading