Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

> no suitable ctc data to write when basecalling with save-ctc flag #308

Open
HanielF opened this issue Oct 1, 2022 · 3 comments
Open

Comments

@HanielF
Copy link

HanielF commented Oct 1, 2022

Hi~ I am trying to train a bonito model from scratch.

To obtain my training data, i basecalled the reads with the command as below:

data_path="~/codes/bonito/data"
key="Acinetobacter_baumannii_AYP-A2"

bonito basecaller "[email protected]" "${data_path}/train_data/${key}" \
   --save-ctc \
   --reference "${data_path}/train_data/${key}/read_references.fasta" \
   --batchsize 100 \
   --recursive \
   > "${data_path}/train_data/${key}/calls.sam"

It will raise an error, > no suitable ctc data to write, in bonito.io.CTCWriter.run(). It means that none of the reads can pass the checks.
To improve the accuracy, I replaced the fast model with [email protected]. Several hundred reads passed the checks this time.

Also, it's weird that only 413 reads were saved to CTC data, while a total of 100960 reads were input.

I checked the code in bonito.io.CTCWriter and found that most of the reads are filtered by self.min_accuracy, which is set to 0.99 by default.
Here is the statistical results for ctc-data:

......
> ctc results filtering: : 100960it [03:49, 440.28it/s]
Continue! self.min_accuracy: 0.99, acc: 0.8582375478927203 
Continue! self.min_accuracy: 0.99, acc: 0.9616935483870968 
Continue! self.min_accuracy: 0.99, acc: 0.960446247464503 
Continue! self.min_accuracy: 0.99, acc: 0.943952802359882 
Continue! self.min_accuracy: 0.99, acc: 0.9434968017057569 
Continue! self.min_accuracy: 0.99, acc: 0.967741935483871 
Continue! self.min_accuracy: 0.99, acc: 0.9537401574803149 
=== Error analysis ===
seq_err: 0, mapping_err: 38, refseq_err: 0, acc_err: 100501, cov_err: 0
> written ctc training data
  - chunks.npy with shape (413,10000)
  - references.npy with shape (413,1053)
  - reference_lengths.npy shape (413)
> completed reads: 100960
> duration: 0:03:49

Obviously, the CTC training data is not enough even though that is only one of 50 genomes for the whole training set.
Do I need to lower the value of the min_accuracy parameter?

@andreaswallberg
Copy link

Dear @HanielF and @iiSeymour

We appear to be in a similar position. Several millions of reads as input, but only tens of thousands of reads appear to be used. My expectation was that much more of the raw data could be used.

Any suggestions on how to overcome this?

@mark-sor2
Copy link

Hi @iiSeymour,

I have same issue. I have Pod5 which containes multiplexed reads.

  1. In a first experiment I generated susbet pod5 where each one contains reads related to one unique reference sequence (I used the read id after demultiplexing to perform this operation, although some reads were not found in the pod5 files).

bonito basecaller [email protected] --min-accuracy-save-ctc 0.7 --save-ctc --reference ./Fasta/myRef.mmi ./Pod5/myRefPod5/ > ./ctcData/myRef.sam

The basecalling stop after few seconds and doesn't seems to work, I got next result :

reading pod5
outputting aligned sam
loading model [email protected]
model basecaller params: {'batchsize': 32, 'chunksize': 9996, 'overlap': 492, 'quantize': None}
loading reference
read scaling: {'strategy': 'pa'}
no suitable ctc data to write
completed reads: 61
duration: 0:00:05
samples per second 1.3E+05
done

  1. In a second experiment I used all the pod5 (all the multiplexed reads) and provided one reference. I assumed that only the reads related to that reference will remain after the alignment operation. In this case the basecalling seems to works for a few minutes but then stops (at ~15% of the operaiton) and then it says "no suitable ctc data to write". below is the command I used and the resulting output :

bonito basecaller [email protected] --min-accuracy-save-ctc 0.7 --save-ctc --reference ./Fasta/myRef.mmi ./Pod5/allPod5/ > ./ctcData/myRef.sam

reading pod5
outputting aligned sam
loading model [email protected]
model basecaller params: {'batchsize': 96, 'chunksize': 9996, 'overlap': 492, 'quantize': None}
loading reference
read scaling: {'strategy': 'pa'}
no suitable ctc data to write
completed reads: 78814
duration: 0:06:59
samples per second 1.9E+06
done

My reference sequence contain around 400 bases, and the '.mmi' file was generated using ">minimap2 -x map-ont -d ./Fasta/myRef.mmi ./Fasta/myRef.fasta" command. I do not understand why I cannot perform the basecallinf using --save-ctc flag. Note that the basecalling performs well when not using this flag.

thanks in advance for your answer

@sosoJuly822
Copy link

I have the same issue. I tried two datasets of different quality, but even with the slightly higher-quality dataset, the output .bam file was still empty after waiting for 4 hours.

I carefully reviewed the basecaller.py file and found that the process of generating labels involves segmenting the electrical signals based on the chunksize, performing basecalling for each chunk, and finally saving the results based on the score. I'm not sure what went wrong during this process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants