Inconsistensies between genbank and gff files generated and conventions #6

sivico26 · 2021-11-30T16:37:10Z

Dear Ian,

I have detected a couple of inconsistencies when using the Chloe web portal. Specifically, when I upload the chloroplast genome of
Arabidopsis thaliana, I get as a result the following outputs: The genbank file and the gff3 file.

I have been assessing some annotators and expected chloe to work the best (given the results in Zhong's Thesis) but those expectations have not been met so far, which I found puzzling. Recently, I noticed that the gff the genbank outputs are not consistent between themselves. I have been using the genbank file for convenience but the gff might be a better representation of chloe's output.

Specifically, if you look at multiexonic genes that are on the IRs on the genbank file, such as ndhB, you will notice that the CDS are annotated taking an exon from IRA and another from IRB, the other "copy" of the gene is annotated in a similar way, but with the remaining exons of each IR. This happens with the proteins on the IRs, but also with the multiexonic tRNAs and rRNAs.

I have not thoroughly assessed the gff3 file, but this one seems that do not present these problems. Hence, I guess this annotation should be preferred. Still, I think the output of the annotation should be consistent between both formats.

Finally, although these are not inconsistencies, I would like to ask you another two questions:

I noticed that chloe does not include the stop codon of the proteins in the CDS (this seems consistent in the genbank and the gff3). Why is that? The annotators that I have look at so far and the references in RefSeq usually include them, so I think the convention is to include them. Also, some translation software (e.g. Biopython Seq.translate()) look for the stop codon to check if the CDS is okay (That's how I noticed the feature actually).
The gff3 file is nicely formatted and include the gene and mRNA features. Why these are not included in the Genbank file? Although you should be able to infer them from the CDS, It is also a convention for them to be included in the Genbank file.

Thanks again for developing chloe
Cheers

The text was updated successfully, but these errors were encountered:

ian-small · 2021-12-09T09:34:53Z

Hi Simón, Thanks very much for your feedback. I’m sorry that you’re running into problems, especially as the issues have mostly been fixed in the more recent development of Chloë which you don’t have access to. We’re running our test server on the Pawsey Supercomputer Centre infrastructure here https://chloe.pagekite.me/<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchloe.pagekite.me%2F&data=04%7C01%7Cian.small%40uwa.edu.au%7C6b7176a14ecc4c234c2f08d9aa800153%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637728289012660206%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=FK1tYiR7jjrGKkbVYrnJIII%2FNjF0UCWjIXMJ8mH7e5s%3D&reserved=0> Can you access that? I’m not sure whether you need an account on the virtual instance for you to be able to use it. I will definitely make the latest version available to you before the end of the month one way or another, both the web server and the command-line code. I apologise it’s taking longer then I’d hoped, I’m juggling too many projects at the moment. If you have found any genomes that you think Chloe is doing a poor job of annotating, or any other systematic issues, then of course I’d be very interested in hearing about them. Cheers Ian P.S. Chloë’s internal annotation format (which you can see in the .sff output) doesn’t include the stop codon in CDS range, but the .gff output should (and I believe does in the current version). The .sff output is the most faithful representation of what Chloë has found, all the other formats are generated from that. In the version you’re using, the .gff and .gb files are generated by the web interface completely outside the Chloë code, hence some ‘bugs’. In the latest version the .gff output is generated within Choë and is thus more accurate. But the intention longer term is to move to an abstract internal representation of the annotation that can be converted to any supported format in the BioJulia framework (which currently at least includes GFF3 and GenBank). From: Simón Villanueva Corrales ***@***.***> Date: Wednesday, 1 December 2021 at 12:37 am To: ian-small/chloe ***@***.***> Cc: Subscribed ***@***.***> Subject: [ian-small/chloe] Inconsistensies between genbank and gff files generated and conventions (Issue #6) Dear Ian, I have detected a couple of inconsistencies when using the Chloe web portal. Specifically, when I upload the chloroplast genome of Arabidopsis thaliana<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fian-small%2Fchloe%2Ffiles%2F7627080%2Fat_cp.fasta.txt&data=04%7C01%7Cian.small%40uwa.edu.au%7C37e137193b4849a85d0008d9b41faf44%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637738870448636357%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=QxhcNhrwEHk9YzFMeDKIESsKAab6qUTAHMU3ITLRQ5M%3D&reserved=0>, I get as a result the following outputs: The genbank<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fian-small%2Fchloe%2Ffiles%2F7627095%2FNC_000932.1.gbff.txt&data=04%7C01%7Cian.small%40uwa.edu.au%7C37e137193b4849a85d0008d9b41faf44%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637738870448646313%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=sAZpl9RLWRTI1%2Fu6JR5sxOAv%2BPDJQLhYwAjnPiNhlkY%3D&reserved=0> file and the gff3<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fian-small%2Fchloe%2Ffiles%2F7627102%2FNC_000932.1.gff3.txt&data=04%7C01%7Cian.small%40uwa.edu.au%7C37e137193b4849a85d0008d9b41faf44%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637738870448646313%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=exlvkEeqPpV7a6f3D1LYVwCA27b6HiMgBPSRc0zihWg%3D&reserved=0> file. I have been assessing some annotators and expected chloe to work the best (given the results in Zhong's Thesis) but those expectations have not been met so far, which I found puzzling. Recently, I noticed that the gff the genbank outputs are not consistent between themselves. I have been using the genbank file for convenience but the gff might be a better representation of chloe's output. Specifically, if you look at multiexonic genes that are on the IRs on the genbank file, such as ndhB, you will notice that the CDS are annotated taking an exon from IRA and another from IRB, the other "copy" of the gene is annotated in a similar way, but with the remaining exons of each IR. This happens with the proteins on the IRs, but also with the multiexonic tRNAs and rRNAs. I have not thoroughly assessed the gff3 file, but this one seems that do not present these problems. Hence, I guess this annotation should be preferred. Still, I think the output of the annotation should be consistent between both formats. Finally, although these are not inconsistencies, I would like to ask you another two questions: * I noticed that chloe does not include the stop codon of the proteins in the CDS (this seems consistent in the genbank and the gff3). Why is that? The annotators that I have look at so far and the references in RefSeq usually include them, so I think the convention is to include them. Also, some translation software (e.g. Biopython Seq.translate()) look for the stop codon to check if the CDS is okay (That's how I noticed the feature actually). * The gff3 file is nicely formatted and include the gene and mRNA features. Why these are not included in the Genbank file? Although you should be able to infer them from the CDS, It is also a convention for them to be included in the Genbank file. Thanks again for developing chloe Cheers — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fian-small%2Fchloe%2Fissues%2F6&data=04%7C01%7Cian.small%40uwa.edu.au%7C37e137193b4849a85d0008d9b41faf44%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637738870448656275%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=2kplEaccRCnj0FLx%2Bxy7Jaqo2Ocm8eRZATo0%2BVA5v%2Fs%3D&reserved=0>, or unsubscribe<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAD4WKGYPPNDTFBC7EJSDESLUOT4UBANCNFSM5JCIUPRQ&data=04%7C01%7Cian.small%40uwa.edu.au%7C37e137193b4849a85d0008d9b41faf44%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637738870448656275%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=trUiuLBMg5Jq5eiw%2BF4EJIMehTl2wvJzheevqrzEHHo%3D&reserved=0>. Triage notifications on the go with GitHub Mobile for iOS<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cian.small%40uwa.edu.au%7C37e137193b4849a85d0008d9b41faf44%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637738870448666225%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=P4SulmnbV6oXPrlHF74IfWpDhrJTx37yFy99%2FDhXhlY%3D&reserved=0> or Android<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cian.small%40uwa.edu.au%7C37e137193b4849a85d0008d9b41faf44%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637738870448666225%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2BabeDQYfGLJ80KB1q%2BKrilOtqvUeYIn1U0d9VT0n250%3D&reserved=0>.

sivico26 · 2021-12-10T14:53:48Z

Dear Ian,

Thank you for your response. It is great to know that most of the issues are solved in the newest version. I understand the delays as that is quite common in our jobs.

When I follow the link you provided it goes to a "Not found" page, so I guess you are right in that one needs an specific account to access it.

If I found our weirdness I will let you know. I look forward to know about any Chloe updates.

Cheers

sivico26 · 2022-02-24T20:51:51Z

Hi @ian-small,

I wonder if there have been updates on this front.

Cheers,
Simón

ian-small · 2022-02-25T06:10:55Z

Development of the command line version has switched to the chloe_biojulia repo (private at the moment); the code is nearly ready to go public. The web version of this is public and running at https://chloe.plastid.org/annotate.html We still need to fix a few issues with GFF3 and GenBank output, the .sff files generated should be correct. From: Simón Villanueva Corrales ***@***.***> Date: Friday, 25 February 2022 at 4:52 am To: ian-small/chloe ***@***.***> Cc: Ian Small ***@***.***>, Mention ***@***.***> Subject: Re: [ian-small/chloe] Inconsistensies between genbank and gff files generated and conventions (Issue #6) Hi @ian-small<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fian-small&data=04%7C01%7Cian.small%40uwa.edu.au%7Ceab3cfd0d8f545a183ae08d9f7d783c1%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637813327272449463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=WSMD%2FGyPpd51cqH4wk9M%2FwKjd8jSBFZ6SwU6N9pSuZ8%3D&reserved=0>, I wonder if there have been updates on this front. Cheers, Simón — Reply to this email directly, view it on GitHub<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fian-small%2Fchloe%2Fissues%2F6%23issuecomment-1050252753&data=04%7C01%7Cian.small%40uwa.edu.au%7Ceab3cfd0d8f545a183ae08d9f7d783c1%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637813327272449463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=eRiYZhYFR%2F2n74asp1QlMFssx3lct6%2BqXah29ljPaUo%3D&reserved=0>, or unsubscribe<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAD4WKG4KNXJZSHSMVFYOEB3U42K7FANCNFSM5JCIUPRQ&data=04%7C01%7Cian.small%40uwa.edu.au%7Ceab3cfd0d8f545a183ae08d9f7d783c1%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637813327272449463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=B8dHC%2BUuddxYrxcJpLcVfdJeBG86U6gnwWPrCvOGi8I%3D&reserved=0>. Triage notifications on the go with GitHub Mobile for iOS<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cian.small%40uwa.edu.au%7Ceab3cfd0d8f545a183ae08d9f7d783c1%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637813327272449463%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=j2iuYxbHPH3YO7U7BbcgxoTPa%2B9tSpRqHiEBjDGDlvs%3D&reserved=0> or Android<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cian.small%40uwa.edu.au%7Ceab3cfd0d8f545a183ae08d9f7d783c1%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637813327272604974%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=uqxiHYNVHWNrElh6qSYU%2F8tjG7c5tbjCYt4ps405sHY%3D&reserved=0>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Syncing new web code

ink-blot · 2024-08-23T23:23:07Z

It seems that the gff gerneated by https://chloe.plastid.org/ now has the stop codon included. Genbank files still do not have the stop codon included.

ian-small added a commit that referenced this issue Feb 14, 2024

Merge pull request #6 from arabidopsis/main

0037c51

Syncing new web code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistensies between genbank and gff files generated and conventions #6

Inconsistensies between genbank and gff files generated and conventions #6

sivico26 commented Nov 30, 2021

ian-small commented Dec 9, 2021 via email

sivico26 commented Dec 10, 2021

sivico26 commented Feb 24, 2022

ian-small commented Feb 25, 2022 via email

ink-blot commented Aug 23, 2024

Inconsistensies between genbank and gff files generated and conventions #6

Inconsistensies between genbank and gff files generated and conventions #6

Comments

sivico26 commented Nov 30, 2021

ian-small commented Dec 9, 2021 via email

sivico26 commented Dec 10, 2021

sivico26 commented Feb 24, 2022

ian-small commented Feb 25, 2022 via email

ink-blot commented Aug 23, 2024