Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunks.npy and Dataset.py not being generated #385

Open
Sgreenfield9 opened this issue Apr 1, 2024 · 3 comments
Open

Chunks.npy and Dataset.py not being generated #385

Sgreenfield9 opened this issue Apr 1, 2024 · 3 comments

Comments

@Sgreenfield9
Copy link

Hello, I'm trying to train a basecaller using DNA that has been run through an RNA pore. When I run the following code:

bonito basecaller [email protected] --min-accuracy-save-ctc 0 --reference /home/remote /data/minknow/PolyA_DNA_SG/PolyA_DNA_SG/20240320_1335_P2S-01618-B_PAU71604_94a542e0/fast5_pass > /home/remote/basecalls.sam

I receive the following output:

`> calling: 100%|###########################################9| 8969/8979 [15:08<0 > completed reads: 8979

duration: 0:15:13
samples per second 1.6E+06
done`

No errors being thrown so I assume everything is going fine. The issue arrises when I try to run the subsequent bonito train command:

bonito train --epochs 1 --lr 5e-4 --pretrained [email protected] --directory /home/remote /home/remote/fine-tuned-model
`[loading model]
[using pretrained model [email protected]]
[loading data]
Traceback (most recent call last):
File "/home/remote/.local/lib/python3.8/site-packages/bonito/cli/train.py", line 58, in main
train_loader_kwargs, valid_loader_kwargs = load_numpy(
File "/home/remote/.local/lib/python3.8/site-packages/bonito/data.py", line 40, in load_numpy
train_data = load_numpy_datasets(limit=limit, directory=directory)
File "/home/remote/.local/lib/python3.8/site-packages/bonito/data.py", line 66, in load_numpy_datasets
chunks = np.load(os.path.join(directory, "chunks.npy"), mmap_mode='r')
File "/home/remote/.local/lib/python3.8/site-packages/numpy/lib/npyio.py", line 405, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/home/remote/chunks.npy'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/remote/.local/bin/bonito", line 8, in
sys.exit(main())
File "/home/remote/.local/lib/python3.8/site-packages/bonito/init.py", line 34, in main
args.func(args)
File "/home/remote/.local/lib/python3.8/site-packages/bonito/cli/train.py", line 62, in main
train_loader_kwargs, valid_loader_kwargs = load_script(
File "/home/remote/.local/lib/python3.8/site-packages/bonito/data.py", line 31, in load_script
spec.loader.exec_module(module)
File "", line 844, in exec_module
File "", line 980, in get_code
File "", line 1037, in get_data
FileNotFoundError: [Errno 2] No such file or directory: '/home/remote/dataset.py'`

When I look at the directory I wrote my files to /home/remote I find that only a .sam file has been generated but chunks.npy has not. Is my chunks.npy file not being written or is it being written to another location? Any help would be greatly appreciated.

@lkwhite
Copy link

lkwhite commented Apr 1, 2024

Hi all, I am also trying to troubleshoot this issue with Sam, and it's a bit unclear what files should be generated during this step.

bonito basecaller dna_r9.4.1 --save-ctc --reference reference.mmi /data/reads > /data/training/ctc-data/basecalls.sam

Is there a test dataset available that users can work through from the beginning to see what the expected outputs of calling bonito basecaller should be? I see that you have a pre-prepared dataset users can try if they don't have their own reads, but in this case we are trying to prepare our own reads and understand what these errors mean. It's a bit confusing since I do not see any reference to dataset.py in the source code itself.

@iiSeymour
Copy link
Member

@Sgreenfield9 your reads need to map to your reference for any training data to be created.

Can you confirm you reads map? You can check in the basecalls_summary.csv that is created.

Also, note that you seem to be passing /home/remote as --reference which looks like a mistake.

@sparkcyf
Copy link

I found that enable --rna parameter in basecalling process are required to produce chunks, as bonito need to reverse the reference sequence during chunk creating process in RNA basecalling.

bonito/bonito/io.py

Lines 583 to 587 in 0c7fcce

# RNA basecall already reversed. Flip back to signal-oriented for training.
targets.append(target[::-1] if self.rna else target)
chunks.append(read.signal)
lengths.append(len(target))

with --rna:

bonito basecaller [email protected] \
--device cuda:0 \
--save-ctc \
--reference hg38.fa \
--min-accuracy-save-ctc 0.65 --chunksize 3600 --rna \
fast5/ > bonito_rna004_130bps_fast.sam
DeprecationWarning: fast5 support will be deprecated in a future bonito version. Please use pod5
> reading fast5
> outputting aligned sam
> loading model [email protected]
> loading reference
> Chunks rejected from training data:                                                               
 - no_mapping: 22464
 - low_coverage0.90: 1812
 - N_in_sequence: 5
> written ctc training data to sam
  - chunks.npy with shape (8351,3600)
  - references.npy with shape (8351,146)
  - reference_lengths.npy shape (8351)
> completed reads: 32766
> duration: 0:00:19
> samples per second 6.1E+06
> done

without --rna:

> reading fast5
> outputting aligned sam
> loading model [email protected]
> loading reference
> no suitable ctc data to write                                                                     
> completed reads: 32766
> duration: 0:00:18
> samples per second 6.6E+06
> done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants