Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistensies between genbank and gff files generated and conventions #6

Open
sivico26 opened this issue Nov 30, 2021 · 5 comments
Open

Comments

@sivico26
Copy link

Dear Ian,

I have detected a couple of inconsistencies when using the Chloe web portal. Specifically, when I upload the chloroplast genome of
Arabidopsis thaliana, I get as a result the following outputs: The genbank file and the gff3 file.

I have been assessing some annotators and expected chloe to work the best (given the results in Zhong's Thesis) but those expectations have not been met so far, which I found puzzling. Recently, I noticed that the gff the genbank outputs are not consistent between themselves. I have been using the genbank file for convenience but the gff might be a better representation of chloe's output.

Specifically, if you look at multiexonic genes that are on the IRs on the genbank file, such as ndhB, you will notice that the CDS are annotated taking an exon from IRA and another from IRB, the other "copy" of the gene is annotated in a similar way, but with the remaining exons of each IR. This happens with the proteins on the IRs, but also with the multiexonic tRNAs and rRNAs.

I have not thoroughly assessed the gff3 file, but this one seems that do not present these problems. Hence, I guess this annotation should be preferred. Still, I think the output of the annotation should be consistent between both formats.

Finally, although these are not inconsistencies, I would like to ask you another two questions:

  • I noticed that chloe does not include the stop codon of the proteins in the CDS (this seems consistent in the genbank and the gff3). Why is that? The annotators that I have look at so far and the references in RefSeq usually include them, so I think the convention is to include them. Also, some translation software (e.g. Biopython Seq.translate()) look for the stop codon to check if the CDS is okay (That's how I noticed the feature actually).
  • The gff3 file is nicely formatted and include the gene and mRNA features. Why these are not included in the Genbank file? Although you should be able to infer them from the CDS, It is also a convention for them to be included in the Genbank file.

Thanks again for developing chloe
Cheers

@ian-small
Copy link
Owner

ian-small commented Dec 9, 2021 via email

@sivico26
Copy link
Author

Dear Ian,

Thank you for your response. It is great to know that most of the issues are solved in the newest version. I understand the delays as that is quite common in our jobs.

When I follow the link you provided it goes to a "Not found" page, so I guess you are right in that one needs an specific account to access it.

If I found our weirdness I will let you know. I look forward to know about any Chloe updates.

Cheers

@sivico26
Copy link
Author

Hi @ian-small,

I wonder if there have been updates on this front.

Cheers,
Simón

@ian-small
Copy link
Owner

ian-small commented Feb 25, 2022 via email

ian-small added a commit that referenced this issue Feb 14, 2024
@ink-blot
Copy link

It seems that the gff gerneated by https://chloe.plastid.org/ now has the stop codon included. Genbank files still do not have the stop codon included.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants