Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output DNA sequence does not code for correct input protein #57

Open
annb-lab opened this issue Sep 27, 2023 · 8 comments
Open

Output DNA sequence does not code for correct input protein #57

annb-lab opened this issue Sep 27, 2023 · 8 comments

Comments

@annb-lab
Copy link

Output DNA sequence does not always translate to the correct input protein to be optimized.

Reproducible steps:

Click Protein

Select E. coli, and select Yeast

Paste sequence:
MGSSHHHHHHSSGLVPRGSHMGSMAAPSDGFKPRERSGGEQAQDWDALPPKRPRLGAGNKIGGRRLIVVLEGASLETVKVGKTYELLNCDKHKSILLKNGRDPGEARPDITHQSLLMLMDSPLNRAGLLQVYIHTQKNVLIEVNPQTRIPRTFDRFCGLMVQLLHKLSVRAADGPQKLLKVIKNPVSDHFPVGCMKVGTSFSIPVVSDVRELVPSSDPIVFVVGAFAHGKVSVEYTEKMVSISNYPLSAALTCAKLTTAFEEVWGVI

Optimize!

The optimized sequence SD is incorrect:

ATGGGTTCTAGTTCTAGTCATCATCATCATCATCATTCTAGTTCTAGTGGATTAGTTCCAAGGGGTAGTCATATGGGTTCTAGTATGGCAGCACCATCTAGTGATGGTTTTAAACCAAGGGAAAGGTCTAGTGGAGGTGAACAGGCACAGGATTGGGATGCATTACCACCAAAAAGGCCAAGGTTAGGAGCAGGAAATAAAATTGGTGGTAGGAGGTTAATTGTTGTTTTAGAAGGAGCATCTAGTTTAGAAACTGTTAAAGTTGGTAAAACTTATGAATTATTAAATTGTGATAAACATAAATCTAGTATTTTATTAAAAAATGGTAGGGATCCAGGAGAAGCAAGGCCAGATATTACTCATCAGTCTAGTTTATTAATGTTAATGGATTCTAGTCCATTAAATAGGGCAGGTTTATTACAGGTTTATATTCATACACAGAAAAATGTTTTAATTGAAGTTAATCCACAAACTAGGATTCCAAGGACTTTTGATAGGTTTTGTGGTTTAATGGTTCAATTATTACATAAATTATCTAGTGTTAGGGCAGCAGATGGTCCACAGAAATTATTAAAAGTTATTAAAAATCCAGTTTCTAGTGATCATTTTCCAGTTGGTTGTATGAAAGTTGGAACTTCTAGTTTTTCTAGTATTCCAGTTGTTTCTAGTGATGTTAGGGAATTAGTTCCATCTAGTTCTAGTGATCCAATTGTTTTCGTTGTTGGTGCATTTGCACATGGTAAAGTTTCTAGTGTTGAATATACTGAAAAAATGGTTTCTAGTATTTCTAGTAATTATCCATTATCTAGTGCAGCATTAACTTGTGCAAAATTAACTACTGCATTTGAAGAAGTTTGGGGTGTTATTTAA

Which translates to:

MGSSSSHHHHHHSSSSGLVPRGSHMGSSMAAPSSDGFKPRERSSGGEQAQDWDALPPKRPRLGAGNKIGGRRLIVVLEGASSLETVKVGKTYELLNCDKHKSSILLKNGRDPGEARPDITHQSSLLMLMDSSPLNRAGLLQVYIHTQKNVLIEVNPQTRIPRTFDRFCGLMVQLLHKLSSVRAADGPQKLLKVIKNPVSSDHFPVGCMKVGTSSFSSIPVVSSDVRELVPSSSSDPIVFVVGAFAHGKVSSVEYTEKMVSSISSNYPLSSAALTCAKLTTAFEEVWGVI

Hopefully this helps you debug the program!

A few notes. Changing the 'weight' to 2 for each species yields the correct sequence. Choosing E. coli and Human yields the correct sequence.

@nleroy917
Copy link
Owner

nleroy917 commented Sep 29, 2023

Hi @annb-lab thanks for opening this. I've been playing around with things trying to figure this one out. Quick question: you said

The optimized sequence SD is incorrect

Can you verify this for me? I am finding that the AD sequence is the incorrect one. I'm also finding that setting both of the weights to 2 does not solve the problem. Thanks!

@annb-lab
Copy link
Author

annb-lab commented Sep 30, 2023

Both sequences are wrong for me.

AD: ATGGGATCTAGTTCTAGTCATCATCATCATCATCATTCTAGTTCTAGTGGTTTAGTTCCAAGGGGTTCTAGTCATATGGGTTCTAGTATGGCAGCACCATCTAGTGATGGTTTTAAACCAAGGGAAAGGTCTAGTGGTGGTGAACAGGCACAGGATTGGGATGCATTACCACCAAAAAGGCCAAGGTTAGGTGCAGGTAATAAAATTGGTGGTAGGAGGTTAATTGTTGTTTTAGAAGGTGCATCTAGTTTAGAAACTGTTAAAGTTGGAAAAACTTATGAATTATTAAATTGTGATAAACATAAATCTAGTATTTTATTAAAAAATGGTAGGGATCCAGGAGAAGCAAGGCCAGATATTACTCATCAGTCTAGTTTATTAATGTTAATGGATTCTAGTCCATTAAATAGGGCAGGTTTATTACAGGTTTATATTCATACTCAGAAAAATGTTTTAATTGAAGTTAATCCACAGACTAGGATTCCAAGGACTTTTGATAGGTTTTGTGGTTTAATGGTTCAGTTATTACATAAATTATCTAGTGTTAGGGCAGCAGATGGACCACAGAAATTATTAAAAGTTATTAAAAATCCAGTTTCTAGTGATCATTTTCCAGTTGGTTGTATGAAAGTTGGTACTTCTAGTTTTTCTAGTATTCCAGTTGTTTCTAGTGATGTTAGGGAATTAGTTCCATCTAGTTCTAGTGATCCAATTGTTTTTGTTGTTGGAGCATTCGCACATGGTAAAGTTTCTAGTGTTGAATATACTGAAAAAATGGTTTCTAGTATTTCTAGTAATTATCCATTATCTAGTGCAGCATTAACTTGTGCAAAATTAACTACTGCATTTGAAGAAGTTTGGGGTGTTATTTAA

SD:
ATGGGATCTAGTTCTAGTCATCATCATCATCATCATTCTAGTTCTAGTGGATTAGTTCCAAGGGGTTCTAGTCATATGGGATCTAGTATGGCAGCACCATCTAGTGATGGTTTTAAACCAAGGGAAAGGTCTAGTGGAGGTGAACAGGCACAGGATTGGGATGCATTACCACCAAAAAGGCCAAGGTTAGGTGCAGGTAATAAAATTGGAGGTAGGAGGTTAATTGTTGTTTTAGAAGGTGCATCTAGTTTAGAAACTGTTAAAGTTGGTAAAACTTATGAATTATTAAATTGTGATAAACATAAATCTAGTATTTTATTAAAAAATGGAAGGGATCCAGGTGAAGCAAGGCCAGATATTACTCATCAGTCTAGTTTATTAATGTTAATGGATTCTAGTCCATTAAATAGGGCAGGTTTATTACAGGTTTATATTCATACTCAGAAAAATGTTTTAATTGAAGTTAATCCACAGACTAGGATTCCAAGGACTTTTGATAGGTTTTGTGGTTTAATGGTTCAGTTATTACATAAATTATCTAGTGTTAGGGCAGCAGATGGTCCACAGAAATTATTAAAAGTTATTAAAAATCCAGTTTCTAGTGATCATTTTCCAGTTGGTTGTATGAAAGTTGGTACTTCTAGTTTTTCTAGTATTCCAGTTGTTTCTAGTGATGTTAGGGAATTAGTTCCATCTAGTTCTAGTGATCCAATTGTTTTCGTTGTTGGAGCATTTGCACATGGAAAAGTTTCTAGTGTTGAATATACTGAAAAAATGGTTTCTAGTATTTCTAGTAATTATCCATTATCTAGTGCAGCATTAACTTGTGCAAAATTAACTACTGCATTCGAAGAAGTTTGGGGAGTTATTTAA

I found a reduced example:
Select E. coli and Yeast as before
Use S as the sequence
Both AD and SD give: TCTAGTTAA

And yes, both putting the weights to 2 in the reduced example also doesn't fix the problem. I might have made a mistake on that debugging but I was playing with a lot of things.

@malott3
Copy link

malott3 commented Jul 1, 2024

I am using the web interface optimizing for both E. coli and Yeast equally weighted at 1. I am getting the same issue with the S being doubled sometimes. Both the AD and SD translate wrong however the protein sequence is correct. The issue repeats itself with different sequences and occurs whether starting from a protein or DNA input sequence.

Thanks so much for your effort in putting this together. This is such a useful and powerful tool.

Have a great day!

@nleroy917
Copy link
Owner

@malott3

Hello! Thank you for this information! I'm working hard to wrap up a project and hope to focus my efforts on this very soon. It can be tough since the code is >5 years old and was only half-written by me 😭

Promise to keep looking into it, and I appreciate any information you can provide!

@malott3
Copy link

malott3 commented Jul 2, 2024

@nleroy917

Thanks again! I wish you good luck on your project! Yeah, it sounds difficult.

Not sure if this is helpful information or not. Yesterday I had tried clearing the cache and restarting Microsoft edge to no avail. I tried today and it seems to be working correctly. I cannot see a difference. I will let you know if it repeats again.

Have a great afternoon,

Thomas

@malott3
Copy link

malott3 commented Jul 2, 2024

Just an update as promised. I know this is not terribly helpful, but it seems time dependent. It was working great, and then a sequence had all of the S's double. I waited 15 minutes, and it worked just fine. I did several other sequences just fine. Now a half hour later it is doubling every S again for every sequence. I still do not see a correlation to anything useful. Sorry about that. For now, I will just time it or manually remove the extra S's. Still a great tool. Thanks!

Enjoy your day!

@malott3
Copy link

malott3 commented Jul 8, 2024

I have another observations which might help. The tool seems to be not optimizing properly when optimizing for both Yeast and E. coli. I noticed that some amino acids were always using the same codon even if there was not an obvious reason. I pulled the S. cerevisiae and E. coli codon tables and did a weighted average, zeroing everything <10%. The attached image shows a comparation of the codon distribution for 15 optimized genes (Total of 5,270 codons) using the web tool. As you can see the distribution is not even.

Codon distribution

Thanks again for this tool. Not trying to critique, just offering feedback to help identify the possible error.

Have a great day,

Thomas

@nleroy917
Copy link
Owner

Super interesting! Thanks so much for this data.

A little lore/background: the original algorithm was written back in 2019 by my colleague Caleigh. She did an amazing job documenting it all. However, that implementation remains largely untouched in 5 years... I've gone through it all myself, and there isn't much that appears incorrect. But, it is a 5-year-old Python without type annotations, and it can be quite dense at times (three-layer nested for loops everywhere).

I'm currently trying to go through our paper and reimplement the algorithm in Rust; slowly but surely. The new Rust implementation should be safer, faster, and enable the optimization to be done in-browser.

Anyways, the fact that there is such an enrichment for a specific codon for amino acids like arginine or isoleucine makes me think that the table is incorrectly being calculated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants