-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output DNA sequence does not code for correct input protein #57
Comments
Hi @annb-lab thanks for opening this. I've been playing around with things trying to figure this one out. Quick question: you said
Can you verify this for me? I am finding that the AD sequence is the incorrect one. I'm also finding that setting both of the weights to 2 does not solve the problem. Thanks! |
Both sequences are wrong for me. AD: ATGGGATCTAGTTCTAGTCATCATCATCATCATCATTCTAGTTCTAGTGGTTTAGTTCCAAGGGGTTCTAGTCATATGGGTTCTAGTATGGCAGCACCATCTAGTGATGGTTTTAAACCAAGGGAAAGGTCTAGTGGTGGTGAACAGGCACAGGATTGGGATGCATTACCACCAAAAAGGCCAAGGTTAGGTGCAGGTAATAAAATTGGTGGTAGGAGGTTAATTGTTGTTTTAGAAGGTGCATCTAGTTTAGAAACTGTTAAAGTTGGAAAAACTTATGAATTATTAAATTGTGATAAACATAAATCTAGTATTTTATTAAAAAATGGTAGGGATCCAGGAGAAGCAAGGCCAGATATTACTCATCAGTCTAGTTTATTAATGTTAATGGATTCTAGTCCATTAAATAGGGCAGGTTTATTACAGGTTTATATTCATACTCAGAAAAATGTTTTAATTGAAGTTAATCCACAGACTAGGATTCCAAGGACTTTTGATAGGTTTTGTGGTTTAATGGTTCAGTTATTACATAAATTATCTAGTGTTAGGGCAGCAGATGGACCACAGAAATTATTAAAAGTTATTAAAAATCCAGTTTCTAGTGATCATTTTCCAGTTGGTTGTATGAAAGTTGGTACTTCTAGTTTTTCTAGTATTCCAGTTGTTTCTAGTGATGTTAGGGAATTAGTTCCATCTAGTTCTAGTGATCCAATTGTTTTTGTTGTTGGAGCATTCGCACATGGTAAAGTTTCTAGTGTTGAATATACTGAAAAAATGGTTTCTAGTATTTCTAGTAATTATCCATTATCTAGTGCAGCATTAACTTGTGCAAAATTAACTACTGCATTTGAAGAAGTTTGGGGTGTTATTTAA SD: I found a reduced example: And yes, both putting the weights to 2 in the reduced example also doesn't fix the problem. I might have made a mistake on that debugging but I was playing with a lot of things. |
I am using the web interface optimizing for both E. coli and Yeast equally weighted at 1. I am getting the same issue with the S being doubled sometimes. Both the AD and SD translate wrong however the protein sequence is correct. The issue repeats itself with different sequences and occurs whether starting from a protein or DNA input sequence. Thanks so much for your effort in putting this together. This is such a useful and powerful tool. Have a great day! |
Hello! Thank you for this information! I'm working hard to wrap up a project and hope to focus my efforts on this very soon. It can be tough since the code is >5 years old and was only half-written by me 😭 Promise to keep looking into it, and I appreciate any information you can provide! |
Thanks again! I wish you good luck on your project! Yeah, it sounds difficult. Not sure if this is helpful information or not. Yesterday I had tried clearing the cache and restarting Microsoft edge to no avail. I tried today and it seems to be working correctly. I cannot see a difference. I will let you know if it repeats again. Have a great afternoon, Thomas |
Just an update as promised. I know this is not terribly helpful, but it seems time dependent. It was working great, and then a sequence had all of the S's double. I waited 15 minutes, and it worked just fine. I did several other sequences just fine. Now a half hour later it is doubling every S again for every sequence. I still do not see a correlation to anything useful. Sorry about that. For now, I will just time it or manually remove the extra S's. Still a great tool. Thanks! Enjoy your day! |
I have another observations which might help. The tool seems to be not optimizing properly when optimizing for both Yeast and E. coli. I noticed that some amino acids were always using the same codon even if there was not an obvious reason. I pulled the S. cerevisiae and E. coli codon tables and did a weighted average, zeroing everything <10%. The attached image shows a comparation of the codon distribution for 15 optimized genes (Total of 5,270 codons) using the web tool. As you can see the distribution is not even. Thanks again for this tool. Not trying to critique, just offering feedback to help identify the possible error. Have a great day, Thomas |
Super interesting! Thanks so much for this data. A little lore/background: the original algorithm was written back in 2019 by my colleague Caleigh. She did an amazing job documenting it all. However, that implementation remains largely untouched in 5 years... I've gone through it all myself, and there isn't much that appears incorrect. But, it is a 5-year-old Python without type annotations, and it can be quite dense at times (three-layer nested for loops everywhere). I'm currently trying to go through our paper and reimplement the algorithm in Rust; slowly but surely. The new Rust implementation should be safer, faster, and enable the optimization to be done in-browser. Anyways, the fact that there is such an enrichment for a specific codon for amino acids like arginine or isoleucine makes me think that the table is incorrectly being calculated |
Output DNA sequence does not always translate to the correct input protein to be optimized.
Reproducible steps:
Click Protein
Select E. coli, and select Yeast
Paste sequence:
MGSSHHHHHHSSGLVPRGSHMGSMAAPSDGFKPRERSGGEQAQDWDALPPKRPRLGAGNKIGGRRLIVVLEGASLETVKVGKTYELLNCDKHKSILLKNGRDPGEARPDITHQSLLMLMDSPLNRAGLLQVYIHTQKNVLIEVNPQTRIPRTFDRFCGLMVQLLHKLSVRAADGPQKLLKVIKNPVSDHFPVGCMKVGTSFSIPVVSDVRELVPSSDPIVFVVGAFAHGKVSVEYTEKMVSISNYPLSAALTCAKLTTAFEEVWGVI
Optimize!
The optimized sequence SD is incorrect:
ATGGGTTCTAGTTCTAGTCATCATCATCATCATCATTCTAGTTCTAGTGGATTAGTTCCAAGGGGTAGTCATATGGGTTCTAGTATGGCAGCACCATCTAGTGATGGTTTTAAACCAAGGGAAAGGTCTAGTGGAGGTGAACAGGCACAGGATTGGGATGCATTACCACCAAAAAGGCCAAGGTTAGGAGCAGGAAATAAAATTGGTGGTAGGAGGTTAATTGTTGTTTTAGAAGGAGCATCTAGTTTAGAAACTGTTAAAGTTGGTAAAACTTATGAATTATTAAATTGTGATAAACATAAATCTAGTATTTTATTAAAAAATGGTAGGGATCCAGGAGAAGCAAGGCCAGATATTACTCATCAGTCTAGTTTATTAATGTTAATGGATTCTAGTCCATTAAATAGGGCAGGTTTATTACAGGTTTATATTCATACACAGAAAAATGTTTTAATTGAAGTTAATCCACAAACTAGGATTCCAAGGACTTTTGATAGGTTTTGTGGTTTAATGGTTCAATTATTACATAAATTATCTAGTGTTAGGGCAGCAGATGGTCCACAGAAATTATTAAAAGTTATTAAAAATCCAGTTTCTAGTGATCATTTTCCAGTTGGTTGTATGAAAGTTGGAACTTCTAGTTTTTCTAGTATTCCAGTTGTTTCTAGTGATGTTAGGGAATTAGTTCCATCTAGTTCTAGTGATCCAATTGTTTTCGTTGTTGGTGCATTTGCACATGGTAAAGTTTCTAGTGTTGAATATACTGAAAAAATGGTTTCTAGTATTTCTAGTAATTATCCATTATCTAGTGCAGCATTAACTTGTGCAAAATTAACTACTGCATTTGAAGAAGTTTGGGGTGTTATTTAA
Which translates to:
MGSSSSHHHHHHSSSSGLVPRGSHMGSSMAAPSSDGFKPRERSSGGEQAQDWDALPPKRPRLGAGNKIGGRRLIVVLEGASSLETVKVGKTYELLNCDKHKSSILLKNGRDPGEARPDITHQSSLLMLMDSSPLNRAGLLQVYIHTQKNVLIEVNPQTRIPRTFDRFCGLMVQLLHKLSSVRAADGPQKLLKVIKNPVSSDHFPVGCMKVGTSSFSSIPVVSSDVRELVPSSSSDPIVFVVGAFAHGKVSSVEYTEKMVSSISSNYPLSSAALTCAKLTTAFEEVWGVI
Hopefully this helps you debug the program!
A few notes. Changing the 'weight' to 2 for each species yields the correct sequence. Choosing E. coli and Human yields the correct sequence.
The text was updated successfully, but these errors were encountered: