-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistensies between genbank and gff files generated and conventions #6
Comments
Hi Simón,
Thanks very much for your feedback. I’m sorry that you’re running into problems, especially as the issues have mostly been fixed in the more recent development of Chloë which you don’t have access to.
We’re running our test server on the Pawsey Supercomputer Centre infrastructure here https://chloe.pagekite.me/<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchloe.pagekite.me%2F&data=04%7C01%7Cian.small%40uwa.edu.au%7C6b7176a14ecc4c234c2f08d9aa800153%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637728289012660206%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=FK1tYiR7jjrGKkbVYrnJIII%2FNjF0UCWjIXMJ8mH7e5s%3D&reserved=0>
Can you access that? I’m not sure whether you need an account on the virtual instance for you to be able to use it.
I will definitely make the latest version available to you before the end of the month one way or another, both the web server and the command-line code.
I apologise it’s taking longer then I’d hoped, I’m juggling too many projects at the moment.
If you have found any genomes that you think Chloe is doing a poor job of annotating, or any other systematic issues, then of course I’d be very interested in hearing about them.
Cheers
Ian
P.S. Chloë’s internal annotation format (which you can see in the .sff output) doesn’t include the stop codon in CDS range, but the .gff output should (and I believe does in the current version). The .sff output is the most faithful representation of what Chloë has found, all the other formats are generated from that. In the version you’re using, the .gff and .gb files are generated by the web interface completely outside the Chloë code, hence some ‘bugs’. In the latest version the .gff output is generated within Choë and is thus more accurate. But the intention longer term is to move to an abstract internal representation of the annotation that can be converted to any supported format in the BioJulia framework (which currently at least includes GFF3 and GenBank).
From: Simón Villanueva Corrales ***@***.***>
Date: Wednesday, 1 December 2021 at 12:37 am
To: ian-small/chloe ***@***.***>
Cc: Subscribed ***@***.***>
Subject: [ian-small/chloe] Inconsistensies between genbank and gff files generated and conventions (Issue #6)
Dear Ian,
I have detected a couple of inconsistencies when using the Chloe web portal. Specifically, when I upload the chloroplast genome of
Arabidopsis thaliana<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fian-small%2Fchloe%2Ffiles%2F7627080%2Fat_cp.fasta.txt&data=04%7C01%7Cian.small%40uwa.edu.au%7C37e137193b4849a85d0008d9b41faf44%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637738870448636357%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=QxhcNhrwEHk9YzFMeDKIESsKAab6qUTAHMU3ITLRQ5M%3D&reserved=0>, I get as a result the following outputs: The genbank<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fian-small%2Fchloe%2Ffiles%2F7627095%2FNC_000932.1.gbff.txt&data=04%7C01%7Cian.small%40uwa.edu.au%7C37e137193b4849a85d0008d9b41faf44%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637738870448646313%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=sAZpl9RLWRTI1%2Fu6JR5sxOAv%2BPDJQLhYwAjnPiNhlkY%3D&reserved=0> file and the gff3<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fian-small%2Fchloe%2Ffiles%2F7627102%2FNC_000932.1.gff3.txt&data=04%7C01%7Cian.small%40uwa.edu.au%7C37e137193b4849a85d0008d9b41faf44%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637738870448646313%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=exlvkEeqPpV7a6f3D1LYVwCA27b6HiMgBPSRc0zihWg%3D&reserved=0> file.
I have been assessing some annotators and expected chloe to work the best (given the results in Zhong's Thesis) but those expectations have not been met so far, which I found puzzling. Recently, I noticed that the gff the genbank outputs are not consistent between themselves. I have been using the genbank file for convenience but the gff might be a better representation of chloe's output.
Specifically, if you look at multiexonic genes that are on the IRs on the genbank file, such as ndhB, you will notice that the CDS are annotated taking an exon from IRA and another from IRB, the other "copy" of the gene is annotated in a similar way, but with the remaining exons of each IR. This happens with the proteins on the IRs, but also with the multiexonic tRNAs and rRNAs.
I have not thoroughly assessed the gff3 file, but this one seems that do not present these problems. Hence, I guess this annotation should be preferred. Still, I think the output of the annotation should be consistent between both formats.
Finally, although these are not inconsistencies, I would like to ask you another two questions:
* I noticed that chloe does not include the stop codon of the proteins in the CDS (this seems consistent in the genbank and the gff3). Why is that? The annotators that I have look at so far and the references in RefSeq usually include them, so I think the convention is to include them. Also, some translation software (e.g. Biopython Seq.translate()) look for the stop codon to check if the CDS is okay (That's how I noticed the feature actually).
* The gff3 file is nicely formatted and include the gene and mRNA features. Why these are not included in the Genbank file? Although you should be able to infer them from the CDS, It is also a convention for them to be included in the Genbank file.
Thanks again for developing chloe
Cheers
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fian-small%2Fchloe%2Fissues%2F6&data=04%7C01%7Cian.small%40uwa.edu.au%7C37e137193b4849a85d0008d9b41faf44%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637738870448656275%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=2kplEaccRCnj0FLx%2Bxy7Jaqo2Ocm8eRZATo0%2BVA5v%2Fs%3D&reserved=0>, or unsubscribe<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAD4WKGYPPNDTFBC7EJSDESLUOT4UBANCNFSM5JCIUPRQ&data=04%7C01%7Cian.small%40uwa.edu.au%7C37e137193b4849a85d0008d9b41faf44%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637738870448656275%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=trUiuLBMg5Jq5eiw%2BF4EJIMehTl2wvJzheevqrzEHHo%3D&reserved=0>.
Triage notifications on the go with GitHub Mobile for iOS<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cian.small%40uwa.edu.au%7C37e137193b4849a85d0008d9b41faf44%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637738870448666225%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=P4SulmnbV6oXPrlHF74IfWpDhrJTx37yFy99%2FDhXhlY%3D&reserved=0> or Android<https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cian.small%40uwa.edu.au%7C37e137193b4849a85d0008d9b41faf44%7C05894af0cb2846d8871674cdb46e2226%7C0%7C0%7C637738870448666225%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=%2BabeDQYfGLJ80KB1q%2BKrilOtqvUeYIn1U0d9VT0n250%3D&reserved=0>.
|
Dear Ian, Thank you for your response. It is great to know that most of the issues are solved in the newest version. I understand the delays as that is quite common in our jobs. When I follow the link you provided it goes to a "Not found" page, so I guess you are right in that one needs an specific account to access it. If I found our weirdness I will let you know. I look forward to know about any Cheers |
Hi @ian-small, I wonder if there have been updates on this front. Cheers, |
It seems that the gff gerneated by https://chloe.plastid.org/ now has the stop codon included. Genbank files still do not have the stop codon included. |
Dear Ian,
I have detected a couple of inconsistencies when using the Chloe web portal. Specifically, when I upload the chloroplast genome of
Arabidopsis thaliana, I get as a result the following outputs: The genbank file and the gff3 file.
I have been assessing some annotators and expected
chloe
to work the best (given the results in Zhong's Thesis) but those expectations have not been met so far, which I found puzzling. Recently, I noticed that the gff the genbank outputs are not consistent between themselves. I have been using the genbank file for convenience but the gff might be a better representation ofchloe
's output.Specifically, if you look at multiexonic genes that are on the IRs on the genbank file, such as ndhB, you will notice that the CDS are annotated taking an exon from IRA and another from IRB, the other "copy" of the gene is annotated in a similar way, but with the remaining exons of each IR. This happens with the proteins on the IRs, but also with the multiexonic tRNAs and rRNAs.
I have not thoroughly assessed the gff3 file, but this one seems that do not present these problems. Hence, I guess this annotation should be preferred. Still, I think the output of the annotation should be consistent between both formats.
Finally, although these are not inconsistencies, I would like to ask you another two questions:
chloe
does not include the stop codon of the proteins in the CDS (this seems consistent in the genbank and the gff3). Why is that? The annotators that I have look at so far and the references in RefSeq usually include them, so I think the convention is to include them. Also, some translation software (e.g. BiopythonSeq.translate()
) look for the stop codon to check if the CDS is okay (That's how I noticed the feature actually).Thanks again for developing
chloe
Cheers
The text was updated successfully, but these errors were encountered: