Output DNA sequence does not code for correct input protein #57

annb-lab · 2023-09-27T22:15:39Z

Output DNA sequence does not always translate to the correct input protein to be optimized.

Reproducible steps:

Click Protein

Select E. coli, and select Yeast

Paste sequence:
MGSSHHHHHHSSGLVPRGSHMGSMAAPSDGFKPRERSGGEQAQDWDALPPKRPRLGAGNKIGGRRLIVVLEGASLETVKVGKTYELLNCDKHKSILLKNGRDPGEARPDITHQSLLMLMDSPLNRAGLLQVYIHTQKNVLIEVNPQTRIPRTFDRFCGLMVQLLHKLSVRAADGPQKLLKVIKNPVSDHFPVGCMKVGTSFSIPVVSDVRELVPSSDPIVFVVGAFAHGKVSVEYTEKMVSISNYPLSAALTCAKLTTAFEEVWGVI

Optimize!

The optimized sequence SD is incorrect:

ATGGGTTCTAGTTCTAGTCATCATCATCATCATCATTCTAGTTCTAGTGGATTAGTTCCAAGGGGTAGTCATATGGGTTCTAGTATGGCAGCACCATCTAGTGATGGTTTTAAACCAAGGGAAAGGTCTAGTGGAGGTGAACAGGCACAGGATTGGGATGCATTACCACCAAAAAGGCCAAGGTTAGGAGCAGGAAATAAAATTGGTGGTAGGAGGTTAATTGTTGTTTTAGAAGGAGCATCTAGTTTAGAAACTGTTAAAGTTGGTAAAACTTATGAATTATTAAATTGTGATAAACATAAATCTAGTATTTTATTAAAAAATGGTAGGGATCCAGGAGAAGCAAGGCCAGATATTACTCATCAGTCTAGTTTATTAATGTTAATGGATTCTAGTCCATTAAATAGGGCAGGTTTATTACAGGTTTATATTCATACACAGAAAAATGTTTTAATTGAAGTTAATCCACAAACTAGGATTCCAAGGACTTTTGATAGGTTTTGTGGTTTAATGGTTCAATTATTACATAAATTATCTAGTGTTAGGGCAGCAGATGGTCCACAGAAATTATTAAAAGTTATTAAAAATCCAGTTTCTAGTGATCATTTTCCAGTTGGTTGTATGAAAGTTGGAACTTCTAGTTTTTCTAGTATTCCAGTTGTTTCTAGTGATGTTAGGGAATTAGTTCCATCTAGTTCTAGTGATCCAATTGTTTTCGTTGTTGGTGCATTTGCACATGGTAAAGTTTCTAGTGTTGAATATACTGAAAAAATGGTTTCTAGTATTTCTAGTAATTATCCATTATCTAGTGCAGCATTAACTTGTGCAAAATTAACTACTGCATTTGAAGAAGTTTGGGGTGTTATTTAA

Which translates to:

MGSSSSHHHHHHSSSSGLVPRGSHMGSSMAAPSSDGFKPRERSSGGEQAQDWDALPPKRPRLGAGNKIGGRRLIVVLEGASSLETVKVGKTYELLNCDKHKSSILLKNGRDPGEARPDITHQSSLLMLMDSSPLNRAGLLQVYIHTQKNVLIEVNPQTRIPRTFDRFCGLMVQLLHKLSSVRAADGPQKLLKVIKNPVSSDHFPVGCMKVGTSSFSSIPVVSSDVRELVPSSSSDPIVFVVGAFAHGKVSSVEYTEKMVSSISSNYPLSSAALTCAKLTTAFEEVWGVI

Hopefully this helps you debug the program!

A few notes. Changing the 'weight' to 2 for each species yields the correct sequence. Choosing E. coli and Human yields the correct sequence.

nleroy917 · 2023-09-29T21:32:22Z

Hi @annb-lab thanks for opening this. I've been playing around with things trying to figure this one out. Quick question: you said

The optimized sequence SD is incorrect

Can you verify this for me? I am finding that the AD sequence is the incorrect one. I'm also finding that setting both of the weights to 2 does not solve the problem. Thanks!

annb-lab · 2023-09-30T00:07:08Z

Both sequences are wrong for me.

AD: ATGGGATCTAGTTCTAGTCATCATCATCATCATCATTCTAGTTCTAGTGGTTTAGTTCCAAGGGGTTCTAGTCATATGGGTTCTAGTATGGCAGCACCATCTAGTGATGGTTTTAAACCAAGGGAAAGGTCTAGTGGTGGTGAACAGGCACAGGATTGGGATGCATTACCACCAAAAAGGCCAAGGTTAGGTGCAGGTAATAAAATTGGTGGTAGGAGGTTAATTGTTGTTTTAGAAGGTGCATCTAGTTTAGAAACTGTTAAAGTTGGAAAAACTTATGAATTATTAAATTGTGATAAACATAAATCTAGTATTTTATTAAAAAATGGTAGGGATCCAGGAGAAGCAAGGCCAGATATTACTCATCAGTCTAGTTTATTAATGTTAATGGATTCTAGTCCATTAAATAGGGCAGGTTTATTACAGGTTTATATTCATACTCAGAAAAATGTTTTAATTGAAGTTAATCCACAGACTAGGATTCCAAGGACTTTTGATAGGTTTTGTGGTTTAATGGTTCAGTTATTACATAAATTATCTAGTGTTAGGGCAGCAGATGGACCACAGAAATTATTAAAAGTTATTAAAAATCCAGTTTCTAGTGATCATTTTCCAGTTGGTTGTATGAAAGTTGGTACTTCTAGTTTTTCTAGTATTCCAGTTGTTTCTAGTGATGTTAGGGAATTAGTTCCATCTAGTTCTAGTGATCCAATTGTTTTTGTTGTTGGAGCATTCGCACATGGTAAAGTTTCTAGTGTTGAATATACTGAAAAAATGGTTTCTAGTATTTCTAGTAATTATCCATTATCTAGTGCAGCATTAACTTGTGCAAAATTAACTACTGCATTTGAAGAAGTTTGGGGTGTTATTTAA

SD:
ATGGGATCTAGTTCTAGTCATCATCATCATCATCATTCTAGTTCTAGTGGATTAGTTCCAAGGGGTTCTAGTCATATGGGATCTAGTATGGCAGCACCATCTAGTGATGGTTTTAAACCAAGGGAAAGGTCTAGTGGAGGTGAACAGGCACAGGATTGGGATGCATTACCACCAAAAAGGCCAAGGTTAGGTGCAGGTAATAAAATTGGAGGTAGGAGGTTAATTGTTGTTTTAGAAGGTGCATCTAGTTTAGAAACTGTTAAAGTTGGTAAAACTTATGAATTATTAAATTGTGATAAACATAAATCTAGTATTTTATTAAAAAATGGAAGGGATCCAGGTGAAGCAAGGCCAGATATTACTCATCAGTCTAGTTTATTAATGTTAATGGATTCTAGTCCATTAAATAGGGCAGGTTTATTACAGGTTTATATTCATACTCAGAAAAATGTTTTAATTGAAGTTAATCCACAGACTAGGATTCCAAGGACTTTTGATAGGTTTTGTGGTTTAATGGTTCAGTTATTACATAAATTATCTAGTGTTAGGGCAGCAGATGGTCCACAGAAATTATTAAAAGTTATTAAAAATCCAGTTTCTAGTGATCATTTTCCAGTTGGTTGTATGAAAGTTGGTACTTCTAGTTTTTCTAGTATTCCAGTTGTTTCTAGTGATGTTAGGGAATTAGTTCCATCTAGTTCTAGTGATCCAATTGTTTTCGTTGTTGGAGCATTTGCACATGGAAAAGTTTCTAGTGTTGAATATACTGAAAAAATGGTTTCTAGTATTTCTAGTAATTATCCATTATCTAGTGCAGCATTAACTTGTGCAAAATTAACTACTGCATTCGAAGAAGTTTGGGGAGTTATTTAA

I found a reduced example:
Select E. coli and Yeast as before
Use S as the sequence
Both AD and SD give: TCTAGTTAA

And yes, both putting the weights to 2 in the reduced example also doesn't fix the problem. I might have made a mistake on that debugging but I was playing with a lot of things.

malott3 · 2024-07-01T21:09:57Z

I am using the web interface optimizing for both E. coli and Yeast equally weighted at 1. I am getting the same issue with the S being doubled sometimes. Both the AD and SD translate wrong however the protein sequence is correct. The issue repeats itself with different sequences and occurs whether starting from a protein or DNA input sequence.

Thanks so much for your effort in putting this together. This is such a useful and powerful tool.

Have a great day!

nleroy917 · 2024-07-02T13:10:03Z

@malott3

Hello! Thank you for this information! I'm working hard to wrap up a project and hope to focus my efforts on this very soon. It can be tough since the code is >5 years old and was only half-written by me 😭

Promise to keep looking into it, and I appreciate any information you can provide!

malott3 · 2024-07-02T21:32:58Z

@nleroy917

Thanks again! I wish you good luck on your project! Yeah, it sounds difficult.

Not sure if this is helpful information or not. Yesterday I had tried clearing the cache and restarting Microsoft edge to no avail. I tried today and it seems to be working correctly. I cannot see a difference. I will let you know if it repeats again.

Have a great afternoon,

Thomas

malott3 · 2024-07-02T22:32:06Z

Just an update as promised. I know this is not terribly helpful, but it seems time dependent. It was working great, and then a sequence had all of the S's double. I waited 15 minutes, and it worked just fine. I did several other sequences just fine. Now a half hour later it is doubling every S again for every sequence. I still do not see a correlation to anything useful. Sorry about that. For now, I will just time it or manually remove the extra S's. Still a great tool. Thanks!

Enjoy your day!

malott3 · 2024-07-08T17:15:06Z

I have another observations which might help. The tool seems to be not optimizing properly when optimizing for both Yeast and E. coli. I noticed that some amino acids were always using the same codon even if there was not an obvious reason. I pulled the S. cerevisiae and E. coli codon tables and did a weighted average, zeroing everything <10%. The attached image shows a comparation of the codon distribution for 15 optimized genes (Total of 5,270 codons) using the web tool. As you can see the distribution is not even.

Thanks again for this tool. Not trying to critique, just offering feedback to help identify the possible error.

Have a great day,

Thomas

nleroy917 · 2024-07-09T14:45:02Z

Super interesting! Thanks so much for this data.

A little lore/background: the original algorithm was written back in 2019 by my colleague Caleigh. She did an amazing job documenting it all. However, that implementation remains largely untouched in 5 years... I've gone through it all myself, and there isn't much that appears incorrect. But, it is a 5-year-old Python without type annotations, and it can be quite dense at times (three-layer nested for loops everywhere).

I'm currently trying to go through our paper and reimplement the algorithm in Rust; slowly but surely. The new Rust implementation should be safer, faster, and enable the optimization to be done in-browser.

Anyways, the fact that there is such an enrichment for a specific codon for amino acids like arginine or isoleucine makes me think that the table is incorrectly being calculated

nleroy917 mentioned this issue May 28, 2024

The API add Unexpected "S" in sequence #64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output DNA sequence does not code for correct input protein #57

Output DNA sequence does not code for correct input protein #57

annb-lab commented Sep 27, 2023

nleroy917 commented Sep 29, 2023 •

edited

Loading

annb-lab commented Sep 30, 2023 •

edited

Loading

malott3 commented Jul 1, 2024

nleroy917 commented Jul 2, 2024

malott3 commented Jul 2, 2024

malott3 commented Jul 2, 2024

malott3 commented Jul 8, 2024

nleroy917 commented Jul 9, 2024

Output DNA sequence does not code for correct input protein #57

Output DNA sequence does not code for correct input protein #57

Comments

annb-lab commented Sep 27, 2023

nleroy917 commented Sep 29, 2023 • edited Loading

annb-lab commented Sep 30, 2023 • edited Loading

malott3 commented Jul 1, 2024

nleroy917 commented Jul 2, 2024

malott3 commented Jul 2, 2024

malott3 commented Jul 2, 2024

malott3 commented Jul 8, 2024

nleroy917 commented Jul 9, 2024

nleroy917 commented Sep 29, 2023 •

edited

Loading

annb-lab commented Sep 30, 2023 •

edited

Loading