Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trimming CTC data #253

Open
mbhall88 opened this issue May 7, 2022 · 3 comments
Open

Trimming CTC data #253

mbhall88 opened this issue May 7, 2022 · 3 comments

Comments

@mbhall88
Copy link

mbhall88 commented May 7, 2022

Hi,

I am having some issues relating to demultiplexing. I have trained a custom model, it performs extremely well on the DNA of my species of interest, but falls over when it comes to demultiplexing. I am losing more than half of my data to the dreaded "none" bin.

In #26 (comment) it was suggested that trimming the signal could improve this - and this makes a lot of sense. However, the example there assumes trimming in the process of chunkifying a HDF5 file from taiyaki.

I have the chunk data already (from basecalling with --save-ctc) and would like to trim this to achieve the same result as trimming the signal at the starts and the ends by some offset. (I basically want to get rid of signal that relates to the barcode.)

What I am struggling with is how best to do this as I don't know what each of the chunkify files is.

For example, the reference_lengths.npy file has shape (35691,), references.npy has shape (35691, 482), and chunks.npy has shape (35691, 4000). How do each of these files relate to each other?

Let's say I want to trim 100 signal samples from the start and end of each read, how would I do this? (I am open to suggestions for offset sizes - this was an arbitrary number).

@touala
Copy link

touala commented Nov 29, 2023

Any development on this issue? Thanks in advance.

@mbhall88
Copy link
Author

Sadly, no @touala. I never got any response about this issue. I ended up having to abandon my project because of this issue. I tried many different ways to trim the data but couldn't fix this demultiplexing issue unfortunately.

@touala
Copy link

touala commented Jan 8, 2024

Thanks for the response @mbhall88. I'm currently doing the demultiplexing with ONT model and then using my custom model to redo the basecalling... Not great but it seems ok. I'll revisit soon as I need to update all my workflow. Hopefully this got better since last time I tried.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants