Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wiki on experimental strategies for training models with extended alphabet #242

Open
Kirk3gaard opened this issue Mar 18, 2022 · 2 comments

Comments

@Kirk3gaard
Copy link

Hi

It would be cool if there was a wiki section that would include the entire approach (including the wetlab part) of training new models. e.g. what would be the best experimental design to train models for predicting the incorporation of alternative nucleotides?

  1. Sequence PCR products with one nucleotide fully substituted
  2. Sequence PCR products with "normal" nucleotides
  3. Sequence PCR products with a mix of normal and substituted?

Best regards
Rasmus

@mauriciolp
Copy link

Hey Rasmus,

I am also currently facing these questions.
I guess that it would be useful and nice from ONT side to give some tips about this.
Although I think that for the wetlab part you might need to refer to what have been published, for example:

  • Kimoto, Michiko, Si Hui Gabriella Soh, and Ichiro Hirao. 2020. “Sanger Gap Sequencing for Genetic Alphabet Expansion of DNA.” Chembiochem: A European Journal of Chemical Biology 21 (16): 2287–96.
  • Yamashige, Rie, Michiko Kimoto, Yusuke Takezawa, Akira Sato, Tsuneo Mitsui, Shigeyuki Yokoyama, and Ichiro Hirao. 2012. “Highly Specific Unnatural Base Pair Systems as a Third Base Pair for PCR Amplification.” Nucleic Acids Research 40 (6): 2793–2806.

On my case I have the sequence PCR data from a DNA sample with extended alphabet, and I am tweaking Bonito to train with this data.
I have found that some adjustments in the code were necessary to make it work.
Hopefully I can share more about it once this work progresses.

@mauriciolp
Copy link

Took me sometime working on this, but I just uploaded a paper about it on bioRxiv, and created a repository for it here.

In our work we show how to achieve high-throughput sequencing of DNA containing Unnatural Bases (UBs), a.k.a Non-Canonical Bases (NCBs), using Nanopore and de novo basecalling enabled by spliced-based data-augmentation. The code here contains a basecaller architecture modified for learning to also basecall 1 or 2 additional UBs, and includes real-time data-augmentation for generating train data with UBs in all possible sequencing contexts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants